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Abstract 

We introduce a general framework for treating channels with memory and feedback. 
First, we generalize Massey's concept of directed information |23j and use it to characterize 
the feedback capacity of general channels. Second, we present coding results for Markov 
channels. This requires determining appropriate sufficient statistics at the encoder and 
decoder. Third, a dynamic programming framework for computing the capacity of Markov 
channels is presented. Fourth, it is shown that the average cost optimality equation (ACOE) 
can be viewed as an implicit single-letter characterization of the capacity. Fifth, scenarios 
with simple sufficient statistics are described. 
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1 Introduction 

This paper presents a general framework for proving coding theorems for channels with 
memory and feedback. The problem of optimal channel coding goes back to Shannon's 
original work j^Hj. The channel coding problem with feedback goes back to early work by 
Shannon, Dobrushin, Wolfowitz, and others |27j . jllj . |37| . Because of increased demand for 
wireless communication and networked systems there is a renewed interest in this problem. 
Feedback can increase the capacity of a noisy channel, decrease the complexity of the encoder 
and decoder, and reduce latency. 

Recently Verdii and Han presented a very general formulation of the channel coding 
problem without feedback [HSl- Specifically they provided a coding theorem for finite al- 
phabet channels with arbitrary memory. They worked directly with the information density 
and provided a Feinstein-like lemma for the converse result. Here we generalize that formu- 
lation to the case of channels with feedback. In this case we require the use of code-functions 
as opposed to codewords. A code-function maps a message and the channel feedback infor- 
mation into a channel input symbol. Shannon introduced the use of code- functions, which 
he called strategies, in his work on transmitter side information j2H]. Code- functions are 
also sometimes called codetrees pUj . 

We convert the channel coding problem with feedback into a new channel coding prob- 
lem without feedback. The channel inputs in this new channel are code-functions. Un- 
fortunately the space of code-functions can be quite complicated to work with. We show 
that we can work directly with the original space of channel inputs by making explicit 
the relationship between code-function distributions and channel input distributions. This 
relationship allows us to convert a mutual information optimization problem over code- 
function distributions into a directed information optimization problem over channel input 
distributions. 

The concept of directed information was introduced by Massey who attributes it to 
Marko |23j . |22j . Our work, in part, builds on the work of Kramer |19j . [20\ . He used the 
concept of directed information to prove capacity theorems for general discrete memoryless 
networks. These networks include the memoryless two-way channel and the memoryless 
multiple access channel. Here we examine single-user channels with memory and feedback. 

There is a long history of work regarding Markov channels and feedback. Here we 
describe a few connections to that literature. Mushkin and Bar-David |24j examined the 
Gilbert-Elliot channel. Goldsmith and Variaya |15j examine non-ISI Markov channels with- 
out feedback. For the case of HD inputs and symmetric channels they introduce sufficient 
statistics that lead to a single-letter formula. We identify the appropriate sufficient statis- 
tics when feedback is available. In some sense the Markov channel with feedback problem is 
easier than the Markov channel without feedback problem because in the feedback case the 
decoder's information pattern is nested in the encoder's information pattern In this 
paper we do not treat noisy feedback. Viswanathan [25, Caire and Shamai [0], and Das and 
Narayan j^l all examine different classes of channels with memory and side-information at 
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the transmitter and receiver. We present a general framework for treating Markov channels 
with ISI and feedback. 

Many authors consider conditions that insure the Markov channel is information sta- 
ble [^. For example Cover and Pombra [7| show that Gaussian channels with feedback 
are always information stable. In addition some authors consider conditions that insure 
the Markov channel is indecomposable |14j . [Sj. In our work it is shown that solutions to 
the associated average cost optimality equation (ACOE) imply information stability. In 
addition the sufficient condition provided here for the existence of a solution to the ACOE 
implies a strong mixing property of the underlying Markov channel in the same way that 
indecomposability does. 

We consider Markov channels with finite state, channel input, and channel output alpha- 
bets. But with the introduction of appropriate sufficient statistics we quickly find ourselves 
working with Markov channels over general alphabets and states. As shown by Csiszar jH], 
for example, treating general alphabets involve many technical issues that do not arise in 
the finite alphabet case. 

Tatikonda first introduced the dynamic programming approach to computing directed 
information in his PhD thesis [55]. Yang, Kavcic, and Tatikonda have examined the case 
of finite state machine Markov channels in |2H], |39j . Here we present a general framework 
that treats many Markov channels including finite state machine Markov channels. 

In summary, the main contributions of this paper are: 

(1) We generalize Massey's concept of directed information j2Sl and use it to characterize 
the feedback capacity of general channels. 

(2) We present coding results for Markov channels. This requires determining appropriate 
sufficient statistics at the encoder and decoder. 

(3) A dynamic programming framework for computing the capacity of Markov channels 
is presented. 

(4) It is shown that the average cost optimality equation (ACOE) can be viewed as an 
implicit single-letter characterization of the capacity. 

(5) Scenarios with simple sufficient statistics are described. 
Preliminary versions of this work have appeared in , [30] , [SJ , • 

2 Feedback and Causality 

Here we discuss some of the subtleties of feedback and causality inherent in the feedback 
capacity problem. The channel at time t is modelled as a stochastic kernel p{dbt \ a*, b^~^). 
Where a* = (ai,...,at) and = 0. See figure one. The channel output is fed back to 
the encoder with delay one. At time t the encoder takes the message and the past channel 
output symbols Bi, Bt-i and produces a channel input symbol At- At time T the decoder 
takes all the channel output symbols, Bi, Bt, and produces the decoded message. Hence 
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Figure 1: Interconnection 



the time ordering of the variables is 

Message, A\, B\, A2, B2,..., At, Bt, Decoded Message. (1) 

When there is no feedback, under suitable conditions, supp^^^jT) I{A^; B^) characterizes 
the maximum number of messages one can send with small probability of decoding error. 
Our goal in this paper is to generalize this to the case of feedback. To that end we now 
mention some subtleties that will guide our approach. 

One should not supremize the mutual information I{A'^,B'^) over the stochastic kernel 
p{da^ I b'^). We can factor p{da^ \ b'^) = ®J=iP{dat \ a*~^, 6-^). This states that at time 
t the channel input symbol At is allowed to depend on the future channel output symbols 
BJ . This violates the causality implicit in our encoder description. In fact p{da'^ \ 6^) is 
the posterior probability used by the decoder to decode the message at time T. Instead, 
as we will show, one should supremize the mutual information over the directed stochastic 
kernel: p{da^ \ b'^) = (^J^ip{dat \ a^~^ , h^~^). See definition 4.1. 

One should not use the stochastic kernel p{db'^ \ a^) as a model of the channel when 
there is feedback. To compute the mutual information we need to work with the joint 
measure P{da^ ,db'^). In general it is not possible to find a joint measure consistent with 
the stochastic kernels: p{da^ \ b'^) and p{db^ \ a^). Instead, as we will show, the appro- 
priate model for the channel when there is feedback is a sequence of stochastic kernels: 
{p{dbt I a*, b^~^)}f^i. See section 3. 

One should not use the mutual information I{A^;B^) when there is feedback. When 
there is feedback the conditional probabilities P{dbt \ ,b^~^) 7^ P{dbt \ a*, 6*"^) almost 
surely under P{da^ ,db'^). Even though ^t+i occurs after Bt it still has a probabilistic 
influence on it. This is because under feedback At+i is influenced by the past channel 
output Bt. To quote Massey "statistical dependence, unlike causality, has no inherent 
directivity." The mutual information B"^) = Ylt=i ^i^'^i \ B^~^). The information 

transmitted to the receiver at time t, given by I{A^;Bt \ B*'~^), depends on the future 
Af^^. Instead, as we will show, we should use the directed information: I{A^ — > -B^). See 
definition 4.2. 
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3 Channels with Feedback 



In this section we formulate the feedback channel coding problem. We first introduce some 
notation. Let p{dy \ x) represent a stochastic kernel from the measurable spaces X to y. 
We say a stochastic kernel is continuous if for all continuous bounded functions v on y 
the function f v{y)p(dy | x) is a continuous and bounded function on X. See appendix for 
definitions and properties of stochastic kernels. 

Given a joint measure P{dx,dy) we will use P{Y = y \ X = x) (or just P{y \ x)) 
to represent the conditional probability (when it exists.) In general, lower case letters 
p, q, r, ... will be used for stochastic kernels and upper case letter P, Q, R, ... will be used for 
joint measures or conditional probabilities. Let V{X) represent the space of all probability 
measures on X endowed with the topology of weak convergence. 

Capital letters, ^, i?, X, y, Z, will represent random variables and lower case letters, 
a,b,x,y, z, will represent particular realizations. For the stochastic kernel p{dy \ x) we 
have p{y | x) is a number. Given a joint measure Px,Y{dx,dy) = Px{dx) ® p{dy \ x) we 
have p{y \ X) is a random variable taking value p{y \ x) with probability Px{x), p(Y \ X) 
is a random variable taking value p{y \ x) with probability Px^Y{x,y), and p{dy \ X) is a 
random measure- valued element taking value p{dy \ x) with probability Px{x). Finally let 
the notation X — Y — Z denote that the random elements X, Y, Z form a Markov chain. 

We are now ready to formulate the feedback channel coding problem. Let {At}f^i be 
random elements in the finite^ set A with the power-set cr-algebra. These represent the 
channel inputs. Similarly let {Bt}J^i be random elements in the finite set B with the 
power-set cr-algebra. These represent the channel outputs. Let A'^ and represent the 
T-fold product spaces with appropriate product cr-algebras (where T may be infinity.) We 
use "log" to represent logarithm base 2. 

A channel is a family of stochastic kernels {p{dbt \ a'' ,b^~^)}f^i. These channels are 
nonanticipative with respect to time-ordering (1) because the conditioning includes only 
a*, 6*^1. 

Let J-t be the set of all measurable maps ft : B^~^ — > A taking ^ Of. Endow 
J^t with the power-set cr-algebra. Let J^^ = Y[t=i -^t denote the Cartesian product en- 
dowed with the product cr-algebra. Note that since A and B are finite the space is at 
most countable. A channel code-function is an element = (/i,...,/r) G . A distri- 
bution on is given by a specification of a sequence of code-function stochastic kernels 
{p{dft I /*"^)}f=i. Specifically Ppridf'^) = ®J=iP{dft \ /*"^). We will use the notation 
/*(6*"^) = (/i,/2(6i),...,/t(&*-i)). 

A message set is a set W = {1, Af}. Let the distribution P\y on the message set W 
be the uniform distribution. A channel code., is a set of M channel code-functions denoted 
by f'^[w\, w £ W. For message w at time t with channel feedback 6*~^ the channel encoder 
outputs ft[w\{b^~^). A channel code without feedback, is a set of M channel codewords 
denoted by a-^[ic;], w € W. For message w at time t the channel encoder outputs at[w\ 
independent of the past channel outputs 

^The methods in this paper can be generahzed to channels with abstract alphabets. 
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A channel decoder is a map g : W taking 6^ i-^ The decoder waits till it 

observes all the channel outputs before reconstructing the input message. The order of 
events is shown in figure one. 

Definition 3.1 A (T, M, e) channel code over time horizon T consists of M code- functions, 
a channel decoder g, and an error probability satisfying: jj Pv{w 7^ g{b'^) \ w) < e. 

A (T, M, e) channel code without feedback is defined similarly with the restriction that we 
use M codewords. 

In what follows the superscript "o" and "nfb" represent the words "operational" and "no 
feedback." Following ^| we define: 

Definition 3.2 R is an e-achievable rate if, for every 6 > there exists, for all sufficiently 
large T, [T, M, e) channel codes with rate ^"^^ > R — 6. The maximum e — achievable rate 
is the called the e— capacity and denoted C^. The operational channel capacity is defined 
as the maximal rate that is e-achievable for all < e < 1 and is denoted . Analogous 

definitions for C^'^^^ and C^'^^^ hold for the case without feedback. 

Before continuing we quickly remark on some other formulations in the literature. Some 
authors work with different sets of channels for each blocklength T. See for example Do- 
brushin jl2j and Verdii and Han In our context this would correspond to a different 

sequence of channels for each T: {p^idbt \ a*, Theorem 5.1 below will continue to 

hold if we use channels of this form. But we do not need this level of generality to proceed 
with the Markov channels described in section 6. 

Note that in definition 3.2 we are seeking a single number C° that is an achievable 
capacity for all sufficiently large T. Some authors instead, see [7j for example, seek a 
sequence of numbers {C°} such that there exists a sequence of channel codes {(T, 2^^t ^ ex)} 
with eT 0. It will turn out that for the time- invariant Markov channels described in section 
6 the notion of capacity described in definition 3.2 is the appropriate one. We will further 
elaborate on this point in section 4.1 after we have reviewed the concept of information 
stability. 

3.1 Interconnection of Code- Functions to the Channel 

Now we are ready to interconnect the pieces: channel, code- functions, encoder, and decoder. 
We follow Dobrushin's program and define a joint measure over the variables of interest 
that is consistent with the different components |12j . We will define a new channel without 
feedback that connects the code-functions to the channel outputs. Corollary 3.1 below 
shows that we can connect the messages directly to the channel output symbols. 

Let {p{dft I f^~^)}J=i be a sequence of code-function stochastic kernels with joint mea- 
sure IS Ppridf'^) = ®J=iP{dft I /*~^) on T'^. For example PpT may be a distribution that 
places mass 1/M on each of M different code- functions. Given a sequence of code- function 
stochastic kernels {p{dft \ f^~^)}J=i ^ channel {p{dbt \ a*, b^~^)}f^i we want to construct 
a new channel that interconnects the random variables to the random variables . We 
use "Q" to denote the new joint measure Q{df'^ ,da^ ,db'^) that we will construct. 

The following three reasonable properties should hold for our new channel. 
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Definition 3.3 A measure Q{df'^ ,da^ ,db'^) is said to be consistent with the code-function 
stochastic kernels {p{dft \ f^~^)}f=i and the channel {p{dbt \ ,b^~^)}f^i if for each t: 

(1) There is no feedback to tiie code-functions in the new channel; The measure on 

is chosen at time 0. Thus it cannot causally depend on the At 's and Bt 's. Thus for 
each t and all ft, we have 

Qift I F'-' = f'-\ A'-^ = a'-\B'-' = b'-')=p{ft I f'-') 

for Q almost all f^~^,a^~^,b^~^. 

(2) The channel input is a function of the past outputs: For each t, At = Ft{B^~^) Q—a.s. 
Said another way, for each t and all at, we have 

Q{at I F* = /*, A'-^ = a'~\B'-^ = b'-') = %(fe*-i)}(at) 

for Q almost all /*, a*~^, Here 6 is the Dirac measure. 

(3) The new channel preserves the properties of the underlying channel: For each t, and 
all bt, we have 

Q{bt I = /*, A' = a\ B'~^ = = p{bt \ a\ b'-^) 

for Q almost all /*, a*, b^~^ . 

The following lemma shows that there exists a unique consistent measure Q and provides 
the channel from T'^ to . 

Lemma 3.1 Given a sequence of code-function stochastic kernels {p{dft \ and a 

channel {p{dbt \ a*, 6*""^)}^]^ there exists a unique consistent measure Q{df'^ ,da'^ ,db'^) on 
J-^ X X . Furthermore the channel from to B"^ for each t and all bt is given by 

Q{bt I F' = f\ B'~' = = p{bt I f\b'-'), 6*^1) (2) 

for Q almost all /*, 6*^^. 

Proof: Let Q{df^ ,da^ ,db^) = (g)f=iP(d/t | /*"^) %(6t-i)}(dat) (g) | o*,6*-i). For 

finite T this measure exists (see Appendix A.l). An application of the lonescu-Tulcea 
theorem shows that this measure exists for the T = oo case. Clearly this Q is consistent 
and by construction it is unique. 

Note that for each (/*, 6*) the joint measure can be decomposed as 

Qif,b') = Y.Q{f,a\b') 

a' 

i 

a' i=l 

t-1 

= P{h I f\b'~'),b'-')Y.P(ft I f'')Ilp(f^ I Pib^ I a\b'-') 

a'-i i=l 

= P{h I f\b'-'),b'-')Q{f,b'-') 

Thus we have shown equation (2). □ 
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Hence for any sequence of code- function stochastic kernels {p{dft \ /* the stochas- 

tic kernel p{dbt | &*~^) can be chosen as a version of the regular conditional distribu- 

tion Q{dbt I F* = /*, = 6*-i). Thus the stochastic kernels {p{dbt \ &*"^)}f=i 

can be viewed as the channel from T'^ to B^. Note that the dependence is on /*(6*~^) and 
not /*. We will see in section 5 that this observation will greatly simplify computation. 

The almost sure qualifier in equation (2) comes from the fact that may equal 

zero for some This can happen, for example, if either /* has zero probability 

of appearing under PpT{df'^) or has zero probability of appearing under the channel 
{p{dbt I a*, 6*-!)}. 

A distribution Pw on W induces a measure PpT on . Hence: 

Corollary 3.1 A distribution P\y on W, a channel code {/'^[w]}'^^^, and the channel 
{p{dbt\a*,b*^^)}J^i uniquely define a measure Q{dw,da^,db'^) on W x x . Further- 
more the channel from W to B^ for each t and all bt is given by 

Q{bt \ W = w, B'~' = = p{bt I /*H(^*~'), b'-') 
for Q almost all w, b^~^ . 

4 Directed Information 

As discussed in section 2 the traditional mutual information is insufficient for dealing with 
channels with feedback. Here we generalize Massey's notion of directed information to take 
into account any time-ordering of the random variables of interest. 

Definition 4.1 We are given a sequence of stochastic kernels {p{dai \ a'^~^)}fLi. Let I = 
{ii,...,iK} ^ {l,...,iV} where 1 < ii < i2 < ... < iR < N. Let T = {1,...,N}\L. Let 
A^ = {Ai-^, Aij^). Define A^" similarly. Then the directed stochastic kernel of A^ with 
respect to A'^ IS 

K 

k=l 

For each a^" the directed stochastic kernel pj^i^j^ic[dA^ \ a^") is a well defined measure. 
For example: 

f 

/ f{ai,...,aN)PAi\Ai''ida^ I a'^") 
= j p{dai^ I a*i"^) j p{da,^ \ o^^~^) ■ ■ ■ j p{da,^ \ a^'="^)/(ai, ...,aN) 

for all bounded functions / measurable with respect to the product cr-algebra on . Note 
that this integral is a measurable function of a^" . 

One needs to be careful when computing the marginals of a directed stochastic kernel. 
For example, if we are given p{dai), p{da2 | ai), and p{da3 \ ai, 02) with the resulting joint 
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measure P{dai,da2,da^) then with the obvious time ordering: 

{p{a-i I ai, 02)^(01)) / P(a3 | 02) -P - almost ah ai, 02, 03 



^ p(ai, as I 02) 

aigyl 



E 



unless Ai — A2 — A3 forms a Markov chain under P. Here i-*(a3 | 02) corresponds to the 
conditional probability under P. 

Definition 4.2 Given a sequence of stochastic kernels {p{dai \ 0*""*^)}^;^ and I C {1, ...,N}, 
the directed information is defined as 



A' 



Dip 



A', A'" 



A^A'^^Ai' 



Pa'^) 



(3) 



where D{- \ •) is the divergence, Pj^^i y^ic [da^ , da^" ) = p^i^^^i^da^" \ a^)®pj<^i\j^i<^{da^ \ a^"), 
and Pj^i^^icP^jc^da^ jda^") = py^i ^y^ic [da^ \ a^'')®Pjf^i<^{da^'') (here Pj^i<^{da^'') is the marginal 
of PAi,Ai''ida^ ^da^'').) 

We can recover Massey's definition of directed information |^ by applying definition 
4.2 to A^ = AJ- and A^" = with the time-ordering given in (1): I{A^ B'^) = 
Yln=i I B*~^). Unlike the chain rule for mutual information the superscript on A 

in the summation is "t" and not "T". Prom definition 4.2 one can easily show: 



I{A^ 



B 



E 



log 



Pbt\at{B'^ I A^) 
Pbt{BT) 



E 



log 



PA 



t\bt{A^ I B 



Pj,t\st{AT I BT) 



where the stochastic kernel PAT\BT{da^ \ b"^) is a version of the conditional distribution 
P{da^ I b'^). The second equality shows that the directed information is the ratio between 
the posterior distribution and a "causal" prior distribution. 



Note that I{A^]B 



E 



log 



p{B' I A' )p{A' I B' ) 
P{BT)P(AT) 



T 



By definition 4.2 and time-ordering (1) we have I{B'^ A'^) = Ylt=i H'^f, B^~^ \ A*"^). 



There is no feedback if and only iiAt — A^ ^ — i?* ^ forms a Markov chain under P. Hence 
I{B^ A^) = 0. There is no "information" flowing from the receiver to the transmitter. 
Because divergence is nonnegative we can conclude that I{A^;B^) > I{A^ B'^) with 
equality if and only if there is no feedback [23] , ^H] • 



4.1 Information Density, Directed Information, and Capacity 

When computing the capacity of a channel it will turn out that we will need to know 

p J. j,{A'^,B^) 

the convergence properties of the random variables Tp log — — — ^ . This is the 

^ P^t^btPbt(AT,bT) 

normalized information density discussed in [2SI suitably generalized to treat feedback. If 
there are reasonable regularity properties, like information stability (see below), then these 
random variables will converge in probability to a deterministic limit. In the absence of 
any such structure we are forced to follow Verdii and Han's lead and define the following 
"floor" and "ceiling" hmits 
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The limsup in probability of a sequence of random variables {Xt} is defined as the 
smahest extended real number a such that Ve > lim^^oo Pr[^t > + e] =0. The liminf 
in probability of a sequence of random variables {Xt} is defined as the largest extended real 
number a such that Ve > limt^oo Pr[Xf < a — e] =0. 



Let i{a'^; b'^) = log 
HA ^ B) 



liminf ^i{A^;B^) 
in prob ^ 



For a sequence of joint measures {P^t ^t}^^^ let 

1 



and I{A B) 



limsup —i{A^;B'^) 
in prob 



Lemma 4.1 For any sequence of joint measures {Pat ^t}^?^! we have 



liA B) < liminf -IM^ 



B^) < limsup -/(^'^ 



B^) < I{A B) 



Proof: See the appendix A. 2. □ 

We now extend Pinsker's H5] notion of information stability. A given sequence of joint 

measures {P^t b'^}'t=i directed information stable if limj^-.^oo P 



1 



> e 



I{AT^BT) 

Ve > 0. The following lemma shows that directed information stability implies ^i(a"^; 5^) 
concentrates around its mean mI{A^ B^). Note that this mean need not necessarily 
converge. 



Lemma 4.2 If the sequence of joint measures {-P4T bt}t'=i ^-^ directed information stable 
then 1{A ^ B) = liminf r^oo ^ B^) < limsupr_oo ^ = ^(^ ^ B). 



Proof: Directed information stability implies 



lim P 

T->(X) 



B^) 



> ^liA^ 



B^)e 



Ve > 0. 



Because B is finite we know ^I{A'^ — > B'^) < log \B\ hence 



lim P 



T. T3T\ 



^^{A';B 



B^) 



> e 



Ve > 0. 



This observation along with lemma 4.1 proves the lemma. □ 

To compute the different "information" measures we need to determine the joint measure 

)}J=i and we 



Pat ,BT{da , db ). This can be done if we are given a channel {p{dh 
specify a sequence of kernels {p{dat \ a*~^,6*~^)} 



t a 



t lA-l 



t=i- 



Definition 4.3 A channel input distribution is a sequence of kernels {p{dat \ a*~^, b^~^)}f^i. 
A channel input distribution without feedback is a channel input distribution with the fur- 
ther condition that for each t the kernel p{dat \ a^~^,b^~^) is independent of b^~^ . (Specifi- 
cally p{dat I a*-\6*-i) =p{dat \ a*-\6*-i) V6*-\6*-^; 



Let Vt = {{p{dat 



,t-i 



6* be the set of all channel input distributions. Let 



Vf^ C be the set of channel input distributions without feedback. We now define the 
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directed information optimization problems. Fix a channel {p{dbt \ a!' ^)}. For finite T 
let 



Ct = sup -I{A^ ) and C^^^ = sup -I{A^ ^B^)= sup -I{A^ ; B 

T^T T ^nfb ^ ^nfb 

For the infinite horizon case let 



C = sup 1{A B) and C"^ = sup 1{A ^ B) = sup /(^; B) (4) 
{x>t}5?^i ipnfbjoo ^ |2)nfb|oo ^ 

Verdii and Han proved the following theorem for the case without feedback |33j . 
Theorem 4.1 For channels without feedback C'^'^-P^ = C^f^. 

In a certain sense we already have the solution to the coding problem for channels with 
feedback. Specifically lemma 3.1 tells us that the feedback channel problem is equivalent 
to a new channel coding problem without feedback. This new channel is from to B"'" 
and has channel kernels defined by equation (2). Thus we can directly apply theorem 4.1 
to this new channel. 

This can be a very complicated problem to solve. We would have to optimize the mutual 
information over distributions on code functions. The directed information optimization 
problem can often be simpler. One reason is that we can work directly on the original 
space and not on the J-'^ x B'^ space. The second half of this paper describes a 
stochastic control approach to solving this optimization. In the next section, though, we 
present the feedback coding theorem. 

5 Coding Theorem for Channels with Feedback 

In this section we prove the following theorem: 
Theorem 5.1 For channels with feedback = C. 

We first give a high-level summary of the issues involved. The converse part is straight- 
forward. For any channel code and channel we know by lemma 3.1 that there exists a unique 
consistent measure Q{df'^ , da^ , db'^). From this measure we can compute the induced chan- 
nel input distribution {q{dat \ a*~^, (These stochastic kernels are a version of the 
appropriate conditional probabilities.) Now {q{dat \ a^~^ , b^~^)}J^i G Vt but it need not 
be the supremizing channel input distribution. Thus the directed information under the 
induced channel input distribution may be less than the directed information under the 
supremizing channel input distribution. This is how we will show < C. 

The direct part is the interesting part of the theorem 5.1. Here we take the optimizing 
channel input distribution {p{dat \ and construct a sequence of code- function 

stochastic kernels {p{dft \ f^~^)}f=i- We then prove the direct part of the coding theorem 
for the channel from to B^ by the usual techniques for channels without feedback. By 
a suitable construction of Ppr it can be shown that the induced channel input distribution 
equals the original channel input distribution. 
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In section 5.1 we provide the necessary technical lemmas to characterize the relationship 
between code-function distributions and channel input distributions, in section 5.2 we prove 
theorem 5.1, and in section 5.3 we generalize the theorem to more general information 
patterns at the encoder. 

5.1 Main Technical Lemmas 

We first discuss the channel input distribution induced by a given code-function distribu- 
tion. Define the graph{ft) = {(6*-\at) : ft{h^~^) = a*} C B^-^ x A. Let Tt(6*-\at) = 
{ft : {h'-\at) e grapM/i)} and T\h'~\a') = {/* : {b^~\aj) G graph(/,), j = 1, ...,t} . 

In lemma 3.1 we showed the channel from JF-^ to depends only on the channel from 
A"^ to B^. Hence for each t and all bt, we have 

Q{bt I F* = /*, B'-^ = = p{bt I f\b'~^), 6*-^) Q - almost all f\b'-^ 

= Pibt\f\b*-'),b'~') V/*gT*(6*-i,/*(6*-1)) 

We now show that the induced channel input distribution only depends on the sequence 
of code-function stochastic kernels {p{dft \ f^~^)}t=i- 

Lemma 5.1 We are given a sequence of code-function stochastic kernels {p{dft \ f^~^)}f=i, 
a channel {p{dbt \ a*, &*"^)}t=i, cind a consistent joint measure Q{df'^ , da'^ ,db'^). Then the 
induced channel input distribution is, for each t and all at, given by 

Q{at I a'-\b'-') = PpT [Tt{b'-\at) \ T'-\b'~\ a'-^)) (5) 

for Q almost all a*-\6*-i. Where PpT{df^) = ^f=ip{dft \ /*"^). 

Proof: NoteP^T(T*-i(^*"^«*~^)) = Q(T*-i(6*-^ a*"!)) > Q{T^~\b^~^,a^-^),a^~\b^"^) 
= Q(a*-^6*-^). Thus Q(a*-\6*-i) > implies PpT(T*-i(6*-^ a*-^)) > 0. Hence the right 
hand side of equation (5) exists Q-almost surely. 

We now prove the correctness of equation (5). For each t and (a*~^,6*~^) such that 
Q(a*~\ 6*"^) > we have 

Q(a*,6*"i) 

= E fIipif^ \ f'-')] Pib'~' \ a'~') 

/*GTt(6t-i,at) \i=l / 

PpT{Tt{b'-\at) I T*-i(6*^2^a*"i))PpT(T*-i(6*-2,a*-i)M6*-i | a'-') 
= Ppr{Mb'-\at) I T*-i(6*-2,a*-i)) J] Cflpif, \ f -^)5{/,(,»-i)}(a.) ) Pib'~' I a'-') 

ft-^ \i=l J 

= PpT{Tt{b'"\at) I T*-i(6*"2,a*-i))Q(a*-\6*-i) 

where (a) follows because p{b^"^ \ a*"^) does not depend on /*~^ and the delta functions 
{5 f^(pi-i){ai)} restrict the sum over /*~^. Line (b) follows because Q(a*~^,5*"^) > and 
hence the conditional probability exists. □ 
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The ahxLost sure quahfier in equation (5) comes from fact that Q(a*~\6*~^) may equal 
zero for some a*~^,6*~^. This can happen, for example, if PpT{df'^) puts zero mass on 
those /*~^ that produce a*~^ from or if has zero probability of appearing under 
the channel {p{dbt \ a*,6*~^)}. 

We now show the equivalence of the directed information measures for both the ".7-^ — 
and the "^^ - S^" channels. 

Lemma 5.2 For each finite T and every consistent joint measure Q{df'^ , da^ , db'^) we have 

hence I{F'^; B'^) = I{A'^ — > -B"^). Furthermore, if given a sequence of consistent measures 
{Q{df^,da^,db^)}^^-^, thenl{F;B) =1{A^B). 

Proof: Fix T finite. Then for every {f'^ ,b'^) such that Q{f'^ ,b'^) > we have 

QpT^sT{f^,b^) ^ E^rQif^ra^y) 
QFTQBTif^,bn Qsrib^QpTifn 

EgTULiPjft I f"^)%(,.-i)}(a,)p(6t I d\b'-') 

PFT{f^)PBT\AT{b^ I a") 

QBT{h'^)QFT{f^) 

(a) pV|^T(fc^|a^)P^T(T(6^-i,a^)) 
QBT{b"^)qAT\BT{a^ I b^) 

PBT\AT{b^ I E/T nr=i I 

QBTm?AT\BT{a^\b'^) 
E/-Tnr=iP(/t I /*-^)%(fe^-i)}(«t)p(fet I a'M-') 
QBT{h^)?AT\BT{a^\b^) 
T.fTQ{f\a\b^) 

QBT{b^)qAT\BT{aT\bT) 

QAT,BT{a^ y) 

QAT\B'rQBT{a^,b^) 

where (a) follows because the Q marginal Q{df^) = PpT{df'^) and for Q{f'^ , a'^ ,b'^) > 
lemma 5.1 shows PpT{T{b'^~^,a^)) = qAT\BT{a^ \ b^). 

Furthermore, if given a sequence of consistent measures {Q{df^ , da"^ , db'^)}^^'^^, equation 
(6) states that for each T the random variables on the left hand side and right hand side 
are almost surely equal. Hence /(-F; B) = I_{A B). ID 

We have shown how a code-function distribution induces a channel input distribution. 
As we discussed in the introduction to this section, we would like to choose a channel input 
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distribution, {p{dat \ o*~^, and construct a sequence of code-function stochas- 

tic kernels, {p{dft \ /*~^)}j=i, such that the resulting induced channel input distribution, 
{q{dat I a*~^, &*~^)}^]^, equals the chosen channel input distribution. This is shown picto- 
rially: 



{p{dat \ a'-\b*-^)}t, Md/i I /*-')}r=i {q{dat \ a*-\b^-^)}f^. 



The first arrow represents the construction of the code-function distribution from the cho- 
sen channel input distribution. The second arrow is described by the result in lemma 5.1. 
Lemma 5.2 states that I^q{F;B) = Lgi^ ~^ B). Let P correspond to the joint measure 
determined by the left channel input distribution in the diagram. If we can find condi- 
tions such that the induced channel input distribution equals the chosen channel input 
distribution then Iq{A B) = Ip{A B). Consequently /q(F; B) = Ip{A B). 

Definition 5.1 We call a sequence of code-function stochastic kernels {p{dft \ f^~^)}f=i, 
with resulting joint measure PpT{df'^), good with respect to the channel input distribution 

{p{dat\a''~^ , b^~^)}f^i if for each t and all ,h^"^ we have 

PpT{T\b'~\a'))=p{a' I 

Lemma 5.4 below shows good code-function distributions exists. Before proving that we 
show the equivalence of the chosen and induced channel input distributions when a good 
code-function distribution is used. 

Lemma 5.3 We are given a sequence of code-function stochastic kernels {p{dft \ f^~^)}f=i, 
a channel {p{dbt \ a*, &*~^)}t=i, and a consistent joint measure Q{df'^ ,da'^ ,db'^). We are 
also given a channel input distribution {r{dat \ a*"^, 6*"^)}^;^. The induced channel input 
distribution satisfies for each t and all at 

Q{at I a*~\ 6*-^) = r{at \ a*-\ 6*"^) for Q almost all a*"\ 6*"^ (7) 

if and only if the sequence of code-function stochastic kernels {p{dft \ f^~^)}f=i is good with 
respect to {r{dat \ a^~^ , b^~^)}f^i . 

Proof: Note that for each t and all a^: 

Q{at I a'-\b'-^) = PFT{Tt{b'-\at) \ T*-i(6*-2, a*"^)) Q - almost ah a*-\ 5*"^ 

y \ ' ; Q - almost all a*~\ b^~^ 
r{a^ ^ I o' ^) 

= r{at I a*-\ 6*-^) Q - almost ah a*-\ 6*"^ 

where (a) follows from lemma 5.1 and (b) follows from definition 5.1. □ 

Lemma 5.4 For any channel input distribution {p{dat\a^~^ ,b^'"^)}J^i there exists a se- 
quence of code-functions stochastic kernels that are good with respect to it. 
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Proof: For all / define p{ft \ f ) as follows 

Pift I f-') = n^'(/*(^*~') I f'\b'~'),b'-') (8) 

We first show that p{ft \ /*~^) defined in equation (8) is a stochastic kernel. Note that for 
each t and all /*~^ we have 

ft 

ft 

= E^'(«*i/*'n"6'-^),"&*-^) E n piMb'-') \ f-Hb'-'),b'-') 



at 



(a) 



nE^'(°*i/'"'(^*"')'^'"') 

= 1 

where (a) follows by repeating the previous step for each In short, the sum is over all 
functions ft : B*~^ A. Hence the sum over ft can be viewed as a sum over all assignments 
of aj's to each choice of 6*~^. Then the sum of products can be written as a product of 
sums. 

We now show by induction that for each t and all a^,b*^~^ we have PpT(T*(6*^^, a*)) = 
p(a* I b^~^). For t = 1 we have PpT(Ti(ai)) = J2fieTi{ai)P(f^) ~ Pi'^i)- t + 1 we have 

P^t(T*+1(6*,o*+^)) 

i 



/*eT'(6'-i,a') i=l 

= E tlpif. 
= E tlM 
= E tlM 



/i+iGTt+i(f)',at+i) 
/t+iGTt+i(f)',at+i) fof 

r-i)p(ai+i I a*, 6*) E n Pift+iib') I 
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t 

= PFT{r\b'-\a'))p{at+i I a*, 6*) 

I I a*, ft*) 

= p(a*+i I 6*) 

where (a) follows because Z*"*"^ G T*"'"^(6*, a*"*"^), (b) follows from an argument similar to 
that given above, and (c) follows from the induction hypothesis. □ 

A function ft is defined by its graph. In the above construction, equation (8), we have 
enforced independence across the different Specifically for ^ we have 

PpT{T{b'-\at)nT{b'-\at) \ = P^t(T(6*-\ a^) | x P^t(T(6*-\ a^) | 

We do not need to assume this independence. For example it is known that Gaussian 
(linear) channel input distributions are optimal for Gaussian channels. For more details 
see |2ni; |4()j . |41j . When dealing with more complicated alphabets one may want the 
functions ft to be continuous with respect to the topologies of A and B. Continuity is 
trivially satisfied in the finite alphabet case. 

Note that it is possible for distinct code-function stochastic kernels to induce the same 
channel input distribution (almost surely.) In addition, there may be many code- functions 
stochastic kernels that are good with respect to a given channel input distribution. As 
an example consider the case when the channel input distribution does not depend on the 
channel output: {p{dat \ a^~^)}J=i- One choice of Ppr is given in equation (8): 

p(/ti/*-')=n^'(/*(^*"')i/*"'(^*"')) 

Another choice would be to put zero mass on code-functions that depend on feedback (i.e. 
only use codewords): 

One can show that this PpT{df'^) is good with respect to {q{dat \ a*~^)} by checking for 
each t: Ppr {V {b'~\ a')) = n*=iP("t I 

For memoryless channels we know the optimal channel input distribution is {p{dat)}f^i. 
Feedback in this case cannot increase capacity but that does not preclude us from using 
feedback. For example, feedback is known to decrease latency. 

5.2 Feedback Channel Coding Theorem 

Now we can prove the feedback channel coding theorem 5.1. We start with the converse 
part and then prove the direct part. 
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Converse Theorem: Choose a {T,M,e) channel code {f'^[w]}^^i. Place a prior proba- 
bility jj on each code- function /"^[w]. By lemma 3.1 and corollary 3.1 this defines consistent 
measures Q{df^ jda^ ,db^) and Q{d'w, da^ , db'^). The following is a generalization of the 
Verdii-Han converse |33j . 

Lemma 5.5 Every (T, M, e) channel code satisfies 




Proof: Choose a 7 > 0. Let Dyj C be the decoding region for message w. The only 
restriction we place on the decoding regions is that they do not intersect: Dyj n D^, = 
'iw 7^ w. (This is always true when using a channel decoder: Dyj = {w : g{b'^) = w}.) 

Under this restriction on the decoder Verdii and Han show in theorem 4 of that 
any (T, M, e) channel code for the channel {p{dbt \ /*, &*~^)} without feedback (see equation 
(2)) satisfies 

By lemma 5.2 we know that 7^ — ^ — p.iK = — '■ — =- holds U — a.s. U 

QftQbtK^ ) QaT\btQbt{a^ ^ 

Note that in the proof of lemma 5.5 the only property of the decoder we used is the 
restriction that the decoding regions not overlap. Thus the lemma holds independently of 
the decoder that one uses. 

Theorem 5.2 The channel capacity < C. 

Proof: Assume towards a contradiction that C° > C. Specifically, assume there exists 

a sequence of [T^MT^er) channel codes with limT^oo = and liminfr^oo y log My > 
C -)- 27 for some 7 > 0. Then 



er > Q ( i\og < ;^logM^-7 I -2^^ 

1^ 



T Qat\btQbt{AT,BT) - T 



> Q 7;; log ^ ' ■ — < C + 7 - 2^^ 

where the line follows from lemma 5.5 and the second line holds for all sufficiently large T. 
By the definition of C and for all sufficiently large T the mass below C -|- 7 has nonzero 
probability. Therefore the right hand side in the last inequality is greater than zero. Thus 
contradicting er ^ 0. □ 

Direct Theorem: We will prove the direct theorem via a random coding argument. The 
following is a generalization of Feinstein's lemma j33j . 
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Lemma 5.6 Fix a time T, an < e < 1, and a channel {p{dbt \ a*)}^;^ . Then for 
all J > and any channel input distribution {r{dat \ a''~^ ,b^~^)}J^i there exists a {T,M,e) 
channel code that satisfies 

e < Rat,bt Tf; log ^ — < log M + 7 + 2 ^ 

where R^t BT{da^ ,db^) = <S)J=iPid-bt \ b^~^,a^) ®r{dat \ a*~\&*~^). 

Proof: Let {p{dft \ f^~^)}f=i be any sequence of code- function stochastic kernels good with 
respect to the channel input distribution {r{dat \ a*~^, Let Q{df^ ,da^ ,db^) be 

the consistent joint measure determined by this {p{dft \ f^~^)}J=i and the channel. 

Verdii and Han, theorem 2 of [33]) show that for the channel {p{dbt \ /*, b^~^)}f^i without 
feedback and for every 7 > there exists a channel code (T, M, e) that satisfies: 

Lemma 5.2 shows ^^^^g^*- ' ? = ^-^^^-^^^ ' ^ holds Q— almost surely. Lemma 

QptQbt{F\B^) Qj^t^btQbt{AT,BT) ^ 

5.3 shows Q{at \ a*~^,6*~^) = r{at \ a*~^,b^~^) Q— almost surely. Hence Q{da'^ , db'^) = 
RAT,BT{da'^,db'^). □ 

Recall that the random coding argument underlying this result requires a distribution 
on channel codes given by randomly drawing M code-functions uniformly from Q{df'^). 

Theorem 5.3 The channel capacity > C. 

Proof: We follow Fix an e > 0. We will show that C is an e— achievable rate 

by demonstrating for every 6 > and all sufficiently large T there exists a sequence of 
(T, M, 2"— + f ) codes with rate C-6< < C - |. If in the previous lemma we choose 
7 = |, then we get 

< fllog .«^-^«-(^"-^"' <c-l] 

e 

< - 
- 2 

where the second inequality holds for all sufficiently large T. To see this note that by the 
definition of C and T large enough the mass below C — j has probability zero. □ 

By combining theorems 5.2 and 5.3 we can conclude theorem 5.1. Specifically C is 
the feedback channel capacity. It should be clear that if we restrict ourselves to channels 
without feedback then we recover the original coding theorem by Verdu and Han |33j . 
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Definition 5.2 A channel with capacity has a strong converse if for all 6 > and 

every sequence of channel codes, {{T, Mt, ex)}, for which liminf ^°^^t ^ (jo _|_ ^ satisfies 
limr_oo er = 1. 

Following theorem 7 of [HH] we have: 

Proposition 5.1 A channel has a strong converse if and only if sup j^jj^yx, ^ /(^ B) = 

sup|-pj,}^_^ I{A B) and hence = limy^oo ~^ -^"^)- 

The latter part follows from theorem 5.1, lemma 4.1, and the finiteness of B. 



Error Exponents: We can generalize Gallager's |14| error exponent to feedback channels. 
Specifically, the error exponent for rate R and blocklength T is given by 



sup max 



For fun details see 



-pR 



-InV 
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5.3 General Information Pattern 

So far we have assumed that the encoder has access to all the channel outputs B^~^. There 
are many situations, though, where the information pattern [35] at the encoder may be 
restricted. Let f be a finite set and let Et = ipt{B^). Here the measurable functions 
ipt '■ & ^ £ determine the information fed back from the decoder to the encoder. Let 
vj/^ = {-i/^jj^j^. In the case of A-delayed feedback we have Et = i^t{B^) = Bt-A+i- If A = 1 
then Et = tpt{B^') = Bt and we are in the situation discussed above. Quantized channel 
output feedback can be handled by letting the {V'f} be quantizers. The time ordering is 
Ai,Bi,Ei,A2,B2, E2, ...jAt, Bt, Et- 

A channel code-function with information pattern is a sequence of T deterministic 
measurable maps {ft}f=i such that ft : £'*~^ — > A taking e*~^ 1— > at- Denote the set of 
all code-functions with restricted information pattern by J^'^'^ C jr"^. The operational 
capacity with information pattern denoted by C°'*, is defined similarly to definition 
3.2. 

Just as in section 3.1 we can define a joint measure P{df^ , da'^ , db"^ , de^) as the inter- 
connection of the code-functions and the channel {p{dbt \ a*,6*~^)}. Lemma 3.1 follows as 
before except that now condition two of consistency requires both At = F^{E^~^), Et = 
^(S*) Q-a.s. 

Define the channel input distribution with information pattern ^' to be a sequence of 
stochastic kernels {p{dat \ a*~^,6*~^)} with the further condition that for each t the kernel 
p{dat I a*-i,6*-i) = p{dat \ a^-\i;^-^{b^-^)). Let = {{p{dat \ a^-\ b^-^)}f^-^} be the 
set of all channel input distributions with information pattern \I'. Let 

C| = sup ^I{A'^ -B^) for finite T and C* = sup 1{A B). 

For the general information pattern, lemmas 5.1-5.4 and theorems 5.1-5.6 continue to hold 
with obvious modifications. Hence 
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Figure 2: Markov Channel 

Theorem 5.4 For channels with information pattern we have C^'* = C*. 

This result holds because the feedback is a causal, deterministic function of the channel 
outputs. It would be more interesting and practical if the feedback were noisy. This is 
a more complicated problem as it is related to the problem of channel coding with side- 
information at the encoder. 



6 Markov Channels 

In this section we formulate the Markov channel feedback capacity problem. As before let 
A, B be spaces with a finite number of elements representing the channel input and channel 
output, respectively. Furthermore let 5 be a state space with a finite number of elements 
with the counting a-algebra. Let St, At, Bt be measurable random elements taking values in 
5, A, B respectively. See figure 2. There is a natural time-ordering on the random variables 
of interest: 

t"th epoch t+i-st epoch 
W, Si, Ai, Bi, 52, ... , St, At, Bt, St+i, X+i, ^t+i, St+2, ■ ■ ■ , St, At, Bt, W (9) 

First, at time a message W is produced and the initial state Si drawn. The order of 
events in each of the T epochs is described in (9). At beginning of t-th epoch the channel 
input symbol At is placed on the channel by the transmitter, then Bt is observed by the 
receiver, then the state of the system evolves to St+i, and then finally the receiver feeds 
back information to the transmitter. At the beginning of the t + 1 epoch the transmitter 
uses the feedback information to produce the next channel input symbol At+i- Finally at 
time T, after observing Bt, the decoder outputs the reconstructed message W. 

Definition 6.1 A Markov channel consists of an initial state distribution p{dsi) , the state 
transition stochastic kernels {p{dst+i \ st,at,bt)}JSi , and the channel output stochastic 
kernels {p{dht \ st,at)}f^i- If the stochastic kernel p{dst+i \ st,at,ht) is independent of 
0't,h for each t = 1, ...,T — 1 then we say the channel is a Markov channel without ISI 
(intersymhol interference.) 

Note that we are assuming the kernels {p{dst+i \ st, at, h)} and {p{dbt | sj, at)} are stationary 
(independent of time.) 
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As before a channel code-function is a sequence of T deterministic measurable maps 
{ft}J=i such that ft : B^"^ — s- A which takes b^~^ i— > at- We do not assume, for now, that 
the state of the channel is observable to the encoder. This will have the effect of restricting 
ourselves to channel input distributions of the form {p{dat \ a*~^,6*~^)} as opposed to 
{p{dat I s*, a^~^ , b^^^)}. We do assume that both the encoder and the decoder know p{dsi). 
In section 8.1 we show how to introduce state feedback. 

6.1 The Sufficient Statistic {HJ 

Given a sequence of code-function distributions {p{dft \ f^~^)}J=i we can interconnect the 
Markov channel to the source. Via a straightforward generalization of definition 3.3 and 
lemma 3.1 one can show there exists a unique consistent measure: Q{df'^ ,ds'^ jda^ ,db'^) = 
®t=iP{<^ft I f^~^)®p{dst I st-i,at-iM-i)®5{ft(b^-^)}{dO't)®p{dbt \ st,at). Unlike in lemma 
3.1 determining the channel without feedback from to takes a bit more work. To 
that end we introduce the sufficient statistics {n^}. 

Let n(ds) G 'PiS) be an element in the space of probability measures on S. Define a 
stochastic kernel from 7^(5) x ^ to 5 x ;B: 

r{ds,db I 7r,a) = p{db \ s, a) <S 7r{ds) (10) 

The following lemma follows from theorem A. 3 in the appendix. 

Lemma 6.1 There exists a stochastic kernel r{ds \ 7r,a,b) from 7^(3) x A x B to S such 
that 

r{ds, db I vr, a) = r{ds \ vr, a, b) (8) r{db \ vr, a) 
where r(db \ vr, a) is the marginal ofr{ds,db \ vr, a). Specifically, for each b: 

r{b I TT, a) = | s, a)7r(s) (11) 

s 

The statistic n{ds) is often called the a priori distribution of the state and r{ds \ vr, a, b) 
the a posteriori distribution of the state after observing a, b. We recursively define the 
sufficient statistics {HtYt^i- Specifically lit '■ A^~^ x —>■ V{S) defined as follows: 

TTl{dsi) = p{dsi) (12) 

(where p{dsi) is given in definition 6.1) and for each a*, 6* and all s^+i: 

7rt+i[a*,6*](st+i) = ^p(st+i | st, at,bt)r {st \ ttj [a*"\ 6*~^]((ist), at, 6t) (13) 

St 

Equations (12) and (13) are the so-called filtering equations for the state of the channel 
based on the channel inputs and outputs. Note that equation (13) implies there exists a 
deterministic, stationary, measurable function <l>n such that vr^+i = $n(7rt, a*, &t) for all 
t = l,...,T — 1. Note the statistic Ht depends on information from both the transmitter and 
the receiver. It can be viewed as the complete system's estimate of the state. 

We will now show that the {lit} defined in equations (12) and (13) are consistent with 
the conditional probabilities {Q{dst \ /*, a*^^, 6*^^)}. 
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Lemma 6.2 We are given a sequence of code-function stochastic kernels {p{dft \ /*^^)}, a 
Markov channel p{dsi),{p{dst+i \ st,at)},{p{dbt \ St,at)}, and a consistent joint measure 
Q{df'^, ds^ , da^ , db'^). Then for each t and all st we have 

Q{st I f,a\b'-') = 7rt[a'~\b'~']{st) (14) 

for Q almost all f^,a*, b^~^. 

Proof: We will prove equation (14) by induction. For t = 1 and all si we have T^i{si)Q{fi, ai) 
= p(si)p(/i)(5{j^}(ai) = Q(/i,si,ai). Now for t + 1 and all st+i we have 

T:t+i[a\b\st+i) Q{f'+\a'+\b') 
= T:t+i[a\b\st+^) Y,QU'^\^ua'+\b') 



(a) 



^p(st+i I st,at,bt)r [st \ 7ri[a* \ 6* ^]{dst),at,bt) 

\ St / 

X ^^%+i(fet)}(at+i) p(/t+i I /*) p{bt I st,at) %(b*~i)}(ai)^[a*~\ 6*-^](sf) Q(/*, a*-\ 6*"^) j 
= ^Pist+i I st,at,bt) ir{st\ TTt[a^~^ ,b*~^]{dst), at,bt)^p{bt \ st,at) 7r[a*"\ 6*~^](st) | 

St \ St / 

x%+i(fet)}(at+i)p(/t+i I /*)%(6*-i)}(at) Q{f\a'-\b'~^) 
= ^p(st+i I st,at,bt) {p{bt I St, at) 7r[a*~\ 6*~^](st)) 

X'5/t+i(fe')(ot+i)p(/t+i I /*) %t(fet-i)}(at) Q(/*,a*-\6*~^) 
- ^^{ft+im}((^t+i) Pi^t+i I st,at,bt) p{ft+i j /*) p(?)t I St, at) %t(fe'-i)}(at) Q{f\st,a^"^,b^~'^) 

St 

= ^g(/*+\5t,st+i,a*+\6*) 

St 

= Q(/*+\5t+i,a*+\6*) 

where (a) follows from the definition of lit and the induction hypothesis. Line (b) follows 
from lemma 6.1 and (c) is another application of the induction hypothesis. □ 

Note that equation (14) states that the conditional probability Q{dst \ f^ , ,b''~^) does 
not depend on /* almost surely. In addition the filtering equations (12) and (13) are 
defined independently of the code-function distributions (or equivalently the channel input 
distributions). This is related to Witsenhausen's work on policy independence [3^]. Finally 
observe that equation (14) and the fact that 11^ is a function of A^~^,B^~^ imply that 
^t — IIj — (F*, A^, B^~^) forms a Markov chain under any consistent measure Q. 

6.2 Markov Channel Coding Theorem 

We are now in a position to describe the ".F^ — S"^" channel in terms of the underlying 
Markov channel. We then prove the Markov channel coding theorem. 
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Lemma 6.3 We are given a sequence of code-function stochastic kernels {p{dft \ /*~^)}; a 
Markov channel p{dsi), {p{dst+i \ st, at, h)}, {p{dbt | st, at)}; and a consistent joint measure 
Q{df'^ , ds^ , da'^ , db"^). Then for each t and all bf we have 

Q{bt I /*, a', =r{bt\ 7Tt[a'~\ b'~']{dst) , at) (15) 

for Q almost all f^,a^,b^~^. Where r{db \ it, a) was defined in equation (11). 

Proof: For each t note that 

Qif,a',b') = Y,Qif\st,a\b') 

St 

= Y.p{bt\st,at)Q{f,st,a\b'~^) 

St 

Y,P{bt I st,at) 7rt[a'~\b'~']{st) Q{f\a\b'-^) 

St 

= r{bt I 7rt[a'~\b'~']{dst),at) Q{f\a\b'-^) 

where (a) follows from lemma 6.2. □ 

The previous lemma shows that B — (IIj, At) — (-F*, A*"^, B*~^) forms a Markov chain 
under Q. 

Corollary 6.1 We are given a sequence of code-function stochastic kernels {p{dft \ /*~^)}; 
a Markov channel p{dsi), {p{dst+i \ st,at,bt)},{p{dbt \ st,at)}; and a consistent joint mea- 
sure Q{df'^ , ds^ , da^ , db^). Then for each t and all bt we have 

Qih I f\b'~')=r{bt I Mf~\b'~^),b*~']idst),ftib'~')) (16) 

for Q almost all /*, 

Proof: For each t note that 

Qif,b') = Y.Q{f,a',b') 

= J^K^t I 7rt[a'~\b'~']{dst),at) Q{f\a\b'~^) 

= r{bt I T:t[f'~\b'~^),b'-\dst)Jt{b'~^)) Q{f,b*-') 

where the second line follows from lemma 6.3. □ 

The corollary shows that we can convert a Markov channel into a channel of the general 
form considered in sections 3-5. Hence we can define the operational channel capacity, C°, 
for the Markov channel with feedback in exactly the same way we did in definition 4.3. 
We can also use the same definitions of capacity, C, as before. Thus we can directly apply 
theorem 5.1 and its generalization, theorem 5.4, to prove: 

Theorem 6.1 For Markov channels we have = C. For Markov channels with informa- 
tion pattern ^ we have C'^'^ = . 
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We end this section by noting that the use of {n^} can simphfy the form of the directed 
information and the choice of the channel input distribution. 

Lemma 6.4 For a Markov channel I{F^ B'^) = /(A'^ B'^) = YlJ=i HA, ^f, Bt \ B^~^). 

Proof: The first equahty follows from lemma 5.2. The second equality follows from noting 
that I{A^ B'^) = X;f=i I{A^] Bt \ B*-^). For t = 1 we know Ui{dsi) = p{dsi) is a fixed, 
non-random, measure known to both the transmitter and receiver. Hence I{Ai;Bi) = 
I{Ai,Ui;Bi). For t > 1 we have 

IiA';Bt I B'~') = I{AuUt;Bt \ B'-^) + I{A'-^; Bt \ Ut, At, B'-^) - I{Ut; Bt \ A\b'-^) 
= I{At,Ut;Bt I B'-') 

where I{Ilt;Bt \ A^,B'^~^) = because Ht is a function of Lemma 6.3 implies 

- {Ut,At) -Bt is a Markov chain hence I{A^~'^;Bt \ nt,At,B^~^) = 0. □ 
Note that we can view both (At,Ilt) as the input to the channel. This makes sense 
because the decoder needs information about the encoder's estimate of the state given by 
lit- The next lemma shows us that we simplify the form of the channel input distribution. 

Lemma 6.5 Given a Markov channel p{dsi), {p{dst+i \ st,at,bt)},{p{dbt \ st,at)}, and a 
channel input distribution {q{dat \ a*~^,b^~^} with resulting joint measure Q{ds^ ,da'^ jdb"^) 
there exists another channel input distribution of the form {r{dat \ vr^jfe*"^)} with resulting 
joint measure R{ds'^ ,da^ jdh"^) such that for each t we have^ 

R^diTt, dat,db*) = Q{d'Kt, dat, db*) 

and hence lR{At,nt;Bt \ B'-^) = lQ{At,Ut;Bt \ B'-^). 

Proof: From lemmas 6.2 and 6.3 and equation (13) we know 

T 

Q{dTT'^,da^,db'^) = (^r{dbt \ i^t^at) ® q{dat \ a*"S 6*^^) <5{$n(^t-i,at-iA-i)}(^^t) (1^) 
t=i 

where, as an abuse of notation, let '5{$n(7ro,ao,6o)}('^^i) ~ ^{p{dsx)}{dT:i). For each t de- 
fine the stochastic kernel r{dat \ 7rt,b^~^) to be a version of the conditional distribution 
Q{dat I 7r(,6*~^) (see theorem A. 3 in the appendix.) 

We proceed by induction. For t = 1 we know 'Ki{dsi) = p{dsi). For any Borel measurable 
set Q C ViS), ai, bi we have 

R{n,ai,bi)= / r(6i I 7ri,ai)r(ai 1 7ri)5{p(ds^)}((i7ri) = / r{bi \ TTi,ai)Q{d7Ti,ai) = Q{n,ai,bi). 
Jn Jn 

^For any Borel measurable Q C V{S) let Q{TTt G fi, at = at, &* = 6*) = 
Q ({(a*, 6') : at = at, b = b', 7rt[a ^ , 6 ^ ] G fi}) . 
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Now for t + 1 and any Borel measurable set C V{S), at+\, we have 




JnJv{s) 



■'{<5) 



Ur{bt+i I 7rj+i,at+i)r(af+i | 7Tt+i,b^)Q{dirt,dTrt+i,at,b^) 
■'{'5) 



where (a) follows from the induction hypothesis and (b) follows by the construction of 
r{dat+i I 7rj+i,6*). □ 

The lemma states that we can without loss of generality restrict ourselves to channel 
input distributions of the form {q{dat \ 7rj,6*~^)}. Note that the dependence on a*~^ ap- 
pears only through 7rj[o*~^, 6*~^]((isj). If vr^ [a*~^, is not a function of a*~^ then the 
distribution of at will depend only on the feedback We discuss when this happens in 
section 8. 

In summary, we have shown that any Markov channel, p{dsi), {p{dst+i \ st,at,bt)}, 
{p{dbt I St, at)} can be converted into another Markov channel with initial state Ili{dTri) = 
^{p{dsi)}id'^i), deterministic state transitions Ilt+i = ^Yi{Ilt, At, Bt), and channel output 
stochastic kernels {r{dbt \ irt,at)}. We call this the canonical Markov channel associated 
with the original Markov channel. Thus the problem of determining the capacity of a 
Markov channel with state space S has been reduced to determining the capacity of the 
canonical Markov channel. The latter Markov channel has state space and state 

computable from the channel inputs and outputs. 

Note that even if the original Markov channel does not have ISI it is typically the case 
that the canonical Markov channel will have ISI. This is because the choice of channel 
input can help the decoder identify the channel. This property is called dual control in the 
stochastic control literature [H]. 
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7 The MDP Formulation 



Our goal in this section is to formulate the following optimization problem for Markov 
channels with feedback as an infinite horizon average cost problem: 

T 

supliminf ;^/(^'^ ^ B^) = supliminf V H*; St | 5*"^). (18) 



t=l 



We first give a high-level discussion of the issues, then we formulate the optimization prob- 
lem in equation (18) as a partially observed Markov decision problem (POMDP), convert 
the POMDP to a fully observed MDP, and provide the average cost optimality equation 
(ACOE). 

Before proceeding the reader may notice that the optimization in equation (18) is differ- 
ent than the one given after definition 4.4: C = supj^^y}^ ^ X(^ ^ B). In the course of this 
section it will be shown that the two optimizations are equivalent. That one can without 
loss of generality restrict the optimization to D^o instead of {'Dt}'t'=i ^ consequence of 
Bellman's principle of optimality. In addition conditions will be given such that under the 
optimal channel input distribution we have liminfT^oo tH^^ ^ B'^) = L{A B). 

To compute I{At,Iit'-,Bt \ B^~^) we need to know the measure: 

Q{dTrt,dat,db^) = r{dbt \ TTt,at) (g) q{dat \ vrt, 6*"^) (g) Q((i7r(, c?5*"^). (19) 

By lemma 6.5 we know that we can without loss of generality restrict ourselves to channel 
input distributions of the form {q{dat \ 7rt,6*~^)}. 

To formulate the optimization in (18) as a stochastic control problem we need to specify 
the state space, the control actions, and the running cost. On first glance it may appear 
that the encoder should choose control actions of the form ut{dat) based on the information 
(vr^ [o*""*^, 6*~^). Unfortunately one cannot write the running cost in terms of ut{dat). 
To see this observe that the argument under the expectation in I{At,Ilt', Bt \ B^~^) = 



E 



Q{Bt I 



can be written as 



r(ht at,TTt) r{bt at,-Kt) m ^ ^- u ut 

D/h I = r r \ ~ ~ \rM^~ ^~ I ut~i\ Q ~ almost all 7Tt,at,b (20) 

Q{bt I 0* ^) J J r{bt I TTt,at)Q{dTTt,dat \ 6* ^) 

This depends on Q{dTTt,dat \ b*^^) and not Q{dat \ 7rt,b^~^). 

This suggests that the control actions should be stochastic kernels of the form ut{dat \ irt)- 
This too is problematic. Note that we are interested in an optimization given in equation 
(18) and hence would like for there to be a topology on the space of stochastic kernels of 
the form ut{dat \ vrf). In some cases there is a natural parameterization of this space. For 
example, for Gaussian channels it is known that the optimal input distribution is linear 
and can be parameterized by its coefficients [7], [22]) 00]) US- But in general there is 
no, at least to the author's knowledge, natural topology on the space of stochastic kernels. 
Hence we will choose control actions of the form ut{d7Tt,dat). The next section formalizes 
the stochastic control problem with this choice of control action. 
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7.1 Partially Observed Markov Decision Problem 

Here we first describe the components of tlie POMDP formulation. In the next section we 
show the equivalence of the POMDP formulation to the optimization (18). 

Consider the control action u{dTr, da) in the control space U = V{V{S) x A). The space 
U is a. Polish space (i.e. a complete, separable metric space) equipped with the topology of 
weak convergence. 

The state at time t > I is Xt = {Ilt-i, At-i,Bt-i) G V{S) x ^ x S and Xi = 0. The 
dynamics are given as: 

r{dxt+i I xt,ut) = r{dbt \ Trt,at) Ut{dTTt,dat) (21) 

Note that they dynamics depend only on ut- The observation at time t > 1 is given by 
Yf = Bf^i and Yi = 0. Note that Yj is a deterministic function of Xf. 

As discussed, one of the main difficulties in formulating (18) as a POMDP has to do 
with the form of the cost (20). The cost at time t is given as 

( log f /L. y^-' - T*'°/i- J- \ if r rlbt I 7rt,dt)ut(dTrt,dat) >0 

c{xt,Ut,Xt+l) = I ^ I r{bt \ 7Tt,at)ut{dnt,dcit) J t, tjt\ i, tj ^22) 

\^ else 

Note that the cost is just a function of ut,xt+i. 

The information pattern at the controller at time t is {Y^,U*'~^) = {B*'~^ ,U*'~^) € 
B^~^ X U^~^. The policy at time t is a stochastic kernel Utidut \ 6*^^, n*^^) from B^~^ x U^^^ 
to U. A policy {fit} is said to be a deterministic policy if for each t and all (&*~^, u*~^) the 
stochastic kernel Ht{dut \ n*~^) assigns mass one to only one point in lA. In this case we 
will abuse notation and write ut = Technically, we should explicitly include p{dsi) 

and the other channel parameters in the information pattern. But because the channel 
parameters are fixed throughout and to reduce notation we shall not explicitly mention the 
control policy's dependence on them. 

The time-order of events is the usual one for POMDPs: Xi,Yi,Ui, X2,Y2,U2---- For a 
given policy {fit} the resulting joint measure is 

T 

R''{du^,dTT'^,da^,db^) = ^r{dbt \ -Kt^at) ® ut{d-nt,dat) ® fit{,dut \ u*~\6*"^) (23) 

t=i 

where we have used equation (21). Note that this R measure is not the same as the one 
used in equation (17) of lemma 6.5. Compare the differences between the R measure given 
in (23) and the Q measure given in equation (17). The next two sections discuss the relation 
between these two different measures. 

7.2 The Sufficient Statistic {r^} and the Control Constraints 

The dynamics given in equation (21), the control policy {fit}, and the running cost given in 
equation (22) are not enough to specify the optimization in equation (18). In particular, in 
the original optimization {U-t} is determined by (13). Whereas in the POMDP optimization 
the {lit} are determined by the policy {fit}- We need to insure that the {Jit} play similar 
roles in both cases. To this end we we will next define appropriate control constraints. 
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Equation (21) states r{dTT,da,db \ u) = r{db \ vr, a) ®u{dTT,da). The following lemma 
follows from theorem A. 3 in the appendix. 

Lemma 7.1 There exists a stochastic kernel r{dTT,da \ u,b) from U x B to V{S) x A 
such that r(d7r,da,db \ u) = r{dTT,da \ u,b) ®r{db \ u) where r[db \ u) is the marginal of 
r{d'K, da, db \ u). 

We now define the statistics {Ft} € V{V{S)). This is the space of probability measures 
on probability measures on S. Specifically Fj : x V{V{S)) is defined as follows. 

For t = 1 let 

7i((i7ri) = S{p^dsi)}idT^i) (24) 
and for t > 1 and each n*~^, 6*"^ and all Borel measurable O C 'P(5): 

7t[ii*~\6*~^](17) = j j {^nK_i,aj_i,6t_i) G Jl}r (dvrt^i, dat_i | ut_i,6j_i). (25) 

Here {•} corresponds to the indicator function. Note that for t > 1, ^t[u^~^ ,b''~^]{d'Kt) 
depends only on ut-i,bt-i. Sometimes, for t > 1, we will just write 'yt[ut-i,bt-i]{d-Kt). 

Equation (25) implies there exists a deterministic, stationary, measurable function $r 
such that 7t+i = ^riut,bt) for all t = 1,...,T — 1. Note that because of feedback the 
statistic Ft can be computed at both the transmitter and the receiver. It can be viewed as 
the receiver's estimate of the transmitter's estimate of the state of the channel. 

We now define the control constraints. Let 

^(7) = {u{d7r,da) : u{dTr,da) G U, u^dir) = 7(d7r)} . (26) 

Note that for each 7 G V{V{S)) the set ^(7) is compact. For each t and (u*""*^, 6*^"*^) the 
control constraint hlt{-) dlA \s defined as: 

Ut{u'~\b'~^)=U{^t[u'~\b'~^]). (27) 

For each t the policy /ij will enforce the control constraint. Specifically for all (u*~^,6*~^) 

^xt{{ut^Ut{^t[u'-\b'-^])] \u'-\b'-^) = l. (28) 

The next lemma shows that the {Ft} are consistent with the conditional probabilities 
R^'{d■nt \ u*-\6*-i). 

Lemma 7.2 We are given p{dsi), the dynamics (21), and a policy {fit} satisfying the 
control constraint (28) with resulting measure {du^ , dTr'^ , da^ , db'^ ) . Then for each t we 
have: 

R^'idTTt I 6*-i) = 7t[7z*-\ 5*-i]((i7rt) (29) 
for R^ almost all it*"^, 5*~^. 
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Proof: Fix a Borel measurable set Q C V{S). For any t, any Borel measurable sets 
Gfc dU, k = 1, t — 1 and any we have 



Ju J0*-i Jn 



lu J0*-i Jn 

f [ ut{n,A)fitidut \ u^-\b^-^)Rf'{du^-\b^-^) 

{ I utin,A)Mdut I u'~\b'~^)) R^{du''\b''^) 

t-1 



e 

where the last equality follows because the control policy satisfies the control constraint 
given in equation (28). □ 

Equations (29) and (25) show the conditional probability R^^dTit \ n*~^,6*~^) does not 
depend on the policy ji and 6*~^ almost surely. See comments after lemma 6.2. 

We can simplify the form of the cost, in the standard way, by computing the expectation 
over the next state. For each t define: 

c{ut) = ERt^[c{Xt,Ut,Xt+i) \ ut] 

- ^ r{dbt I TTt, at)ut{d-Kt, dat) log „ -T*'?*! ^ (30) 

J r{bt I 7rt,at)ut[d-Kt,dat) 



where (a) follows from equation (22) and the fact that q does not depend on xt- 
In summary, we have formulated an average cost, infinite horizon, POMDP: 




with dynamics given by (21) and costs given by (22). The supremization is over all policies 
that satisfy the control constraint (28). In the next section we show that the optimization 
in (31) is equivalent to the optimization in (18). 



7.3 Equivalence of the Optimization Problems 

We now show the equivalence of the optimization problem posed in equation (18) and that 
posed in (31). As discussed at the end of section 7.1 the measures Q and ii^ can be different. 
By equivalence we mean that for any choice of channel input distribution {q{dat \ vr^, 6*~^)} 
with resulting joint measure Q{ds'^ , da^ , db'^) we can find a control policy {/^t} satisfying 
(28) with resulting joint measure R^{du^ , dir'^ , da^ , db^) such that for each t: 

Q {dirt, dat, db*") = Rf" {dirt, dat, db^). (32) 

Vice- versa, given any policy {//t} satisfying (28) we can find a channel input distribution 
{q{dat I 7rt,6*~^)} such that the above marginals are equal. This equivalence will imply 
that the optimal costs for the two problems are the same and the optimal channel input 
distribution is related to the optimal policy. 
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Lemma 7.3 For every channel input distribution {q{dat \ TTt,b^~^)} with resulting joint 
measure Q{ds'^ ,da'^ ,db'^) there exists a deterministic policy {fit} satisfying (28) with result- 
ing joint measure R^{dut, dir'^ , da^ , db"^) such that for each t: R^^d-Kt^dat, db^) = Q{d-Kt, dat, db^) 

Proof: For each t choose a deterministic policy that satisfies: 

^t[b^~^]{dTTt,dat) = Q{dTrt,dat \ 6*"^) 

for Q almost all We proceed by induction. For i = 1 we have R^{d'7Ti,dai,dbi) = 

r{bi I TTi, ai) ® fLi\p{dsi)]{d7ri, dai) = r{bi \ tti, ai) ® Q{d'Ki,dai) = Q{dTri,ai, bi). For t + 1 
we have for any Borel measurable Q C V{S) and all at+i, b^^^: 

m\n,at+i,b'+^) = I r{bt+^\Trt+^,at+i)fit+i[b']{d7Tt+i,at+i) R^'ib') 

Jq 

)Q{d-Kt+i,dat+i I 

where (a) follows from the induction hypothesis and the our choice of ^t+i- 

Now we show the policy {//t} satisfies the control constraint (28). For t = 1 we have 
/ii((i7ri) = Q{dTTi) = 7i((i7ri). For t > 1 we have for any Borel measurable O C V{S) and 
all 6*-^: 

Mb'-'mR^ib'-') 

Qin,b'~') 

=jj {$n(vrt_i,aj_i,6j_i) G ^} r{bt-i \ ^f_i,at_i) fit[b^~\dTTt^i, dat-i) 

^^11 {$n(vrt-i,at_i,&t_i) G n} r{d7Tt-i,dat-i \ ^b'-^h-i) r{bt-i \ fit[b'~^]) R^\b'~^) 

^t[fit^i[b'-\h^^m R>^{b'~^) 

where (a) follows from the first part and the choice of control; (b) follows from our choice of 
control; (c) follows from the first part and lemma 7.1; and (d) follows from equation (25). 
Finally, altering on a set of measure zero if necessary we can insure that for each t the 
deterministic policy fit will enforce the control constraint. Specifically for each 6*"^ we have 
GZ^(7t[Att-i[6*-2],&t-i]). □ 

Lemma 7.4 For every policy {^t} satisfying (28) with resulting joint measure Rf{du^,d7r'^,da^ 
there exists a channel input distribution {q{dat \ vr^jft*""^)} with resulting joint measure 
Q{ds'^ ,da^ ^db"^) such that for each t: Q{dTTt,dat,db^) = R^{d'Kt,dat,db^). 

Proof: For each t choose a channel input distribution that satisfies: 

q{dat I nt,b'-^) = R^{dat \ nt,b'~^) 

for R^ almost all 7rt,b^~^. We proceed by induction. For t = 1 we have Q{dTTi,dai,dbi) = 
r{dbi I 7ri,ai) (8) q{dai \ vri) (g) 6{p[dsi)}id'^i) = R^{dTii,dai,dbi). For t + 1 we have for any 
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Borel measurable Q, C V{S) and all at-\-i,b^^^: 

= r{bt+i \ 7rt+i,at+i) q{at+i \ nt+i,b^) 6[^^l^^^^a^^ht)}{d7rt+i) Q{dTrt,dat,b^) 

J A JV{S) JQ 

/ / / ribt+i I 7rt+i,at+i) q{at+i \ TTt+i,b*) (^{$j3(^^^at,bt)}('^^t+i) R^{d7rt, dat,b^] 
J A JV(S) Jn 

A JV(S) JQ 



(a) 



X r{bt I TTt,at)ut{dTTt,dat) R^{dut,b^ ^) 



^\ ^{•s>n(TTt,at,bt)}id^t+^) ^(d^t^dat \ ut,bt) r{bt \ ut)R^idut,b 

\Ju J A JV{S) 



{£) 



r(6t+i I 7rt+i,at+i) ^(at+i | 7rt+i,5) / '~it+i[ut,bt\{dTTt+i)R^'{dut,b ) 

Ju 

/ r(6t+i I 7rt+i,aj+i) q{at+i \ 7rt+i,6*) /^^(dvrt+i, 6*) 



where (a) follows from the induction hypothesis, (b) follows from lemma 7.1, (c) follows 
from equation (25), (d) follows from lemma 7.2, and (e) follows from the choice of channel 
input distribution. □ 

Lemma 7.5 For every policy {fJLt} satisfying (28) with resulting joint measure R^ there 
exists a deterministic policy {jit} satisfying (28) with resulting joint measure R^ such that 
for each t: Er^ [c{Ut)] < Er, [c{Ut)] . 

Proof: Fix {fJ-t}- By lemma 7.4 we know there is a channel input distribution {q{dat \ iTt, ^*^^)} 
such that for each t: Q{dTTt,dat,db^) = R^{d'7Tt,dat,db^). By lemma 7.3 we know there is a 
deterministic policy {fit} such that for each t: RP'{dTrt,dat,db^) = Q{d'7Tt,dat,db^). Hence, 
for this {p-t}, we have Rf^(diTt,dat,db^) = R'^{d7rt,dat,db^). 

First note that for each t, any Borel measurable Q € 'P{S), and all at, 

fl[b'"']{n,at) R^'ib'-') = fl[b'~']{n,at) Rf'ib'-') 

= R^{n,at,b'-^) 
= R''{n,at,b'-^) 

ut{n,at) R^{dut,b'-^) 



u 



This implies ^\{d'Kt,dO't) = Ji/Ut{dTrt,dat)R'^{dut j 6* ^) for i?^ almost all 6* ^. 
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Now for each t: 
Er. [c{Ut)] 



r{bt I TTt,at) 



(a) 
< 



/ R^{dut,dh^ ^) r{dbt \ TTt,at)ut{diTt,dat)log-i^-ir- _ 

JuxB'-^ JbJaJv{S) J_Ajp(^s)r{bt \ Trt,at)ut[dTTt,dat) 

^R^{db'-^) III r{bt I 7Tt,at) (^lut{d7Tt,dat)R''{dut \ b'~^) 



r{bt I 7rt,at) 

X log ■ 



J J J r{bt \7Tt,at)fl[b'~']{d7rt,dat) 
r{bt I TTt^at) 

X log 



/ / r(6t I {jyu{d^ud~at)R^'{dut \ 6*~i)) 

t~l\ III ^11. I „ „ \r.\lS~U 



f f r{bt I TTt,dt)ixtW ^]{dT:t,ddt) 

/ / r{bt I TTt,dt)ut{dTTt,ddt) 



^ R^{duudb'-^) [ [ [ r{bt I 7Tt,at)ut{d7rt,dat)lor ' ^^''"'^ 



^flp [ciUt)] 



where (a) follows from the conditional Jensen's inequality; (b) follows from above; and (c) 
follows because R^{db''~^) = R^{db^~^) and fit is a deterministic policy. □ 

Thus without loss of generality the policies in the POMDP described in equation (31) 
can be restricted to be deterministic policies. 

Theorem 7.1 The two optimization problems given by (18) and (31) have the same optimal 
cost. 

Proof: First note that for any deterministic policy {^t} satisfying (28) with resulting 
joint measure R^ and, as given in lemma 7.4, an associated channel input distribution 
{q{dat I TTt, &*~^} with associated joint measure Q the following holds for each t: 

Er. [c{Ut)] 

r{bt I ■Kt.a.t) 



R^'{dut,db^ ^) / / / "^(dbt \ ■Kt,at)ut{dTTt,dat) log- 
Bt-i J J J 



UxB^-^ ' J J J ' ' // r{bt I TTf,dt)ut{dTTt,ddt) 

I ■Kt,at)fit[p{' 
r(bt I Trt,at) 



^R^idb'-^) I I I r{dbt\7rt,at)fiMdsi),b'-^]{d7rt,dat) 



X lo 



/ / r{bt I TTt,dt)nt[b* ^]{dTTt,ddt) 



jR^'idb'"^) 1 













''''''''''' ' ^ ffr{bt\nt,~at)R^'{diTt,ddt\bt~^) 

_ ir^^ju ' -{dbt\TTt,at)Q{d7rt,dat\b )log , 

J J r(bt I 'Kt,at)Q{dTTt,dat \ 0* ^) 

= /Q(^t,nt;i?< I B*-i) 

/HM(^t,ni;i?< I s*-i) 
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where (a) and (b) follow because fit is a deterministic policy and hence: R^{diTt,dat \ 

= iit\b*'~^]{dTTt^dat). Lines (c) and (d) follow because Q{dTTt,dat,db*') = R^^{d'7Tt,dat,db^). 

The theorem then follows from this observation and lemmas 7.3-7.5. □ 



7.4 Fully Observed Markov Decision Problem 

In this section we make one final simplification. We will convert the POMDP into a fully 
observed MDP on a suitably defined state space. 

Note that the cost, c{u) given in equation (30), at time t only depends on ut- The 
control constraints, U{'j) given in equation (26), at time t only depends on 7f . The statistics 
7t only depend on p{dsi) in the case t = 1 and only depends on in the 

case t > 1. 

This suggests that Tf € V{V{S)) could be a suitable fully observed state. The dynamics 
are given as: ^i{d-Ki) = 5{p[dsx)}{dTTi) and for t > 1: 



rid'jt+i I 7t,Uf) = / / / 5;^ (ut,bt)}id7t+i) r{dbt \ irt,at) ut{dTTt,dat) (33) 
Jv{S) J A JB 

Lemma 7.6 For every policy {/Uj} satisfying (28) with resulting joint measure R^'' we have 
for each t > 1: 

R^{dlt I 7j-i,iij-i) = r{d-it I 7t-i,'"t-i) (34) 
for ii^ almost all 7f_i, Ut_i. 

Proof: For each t > 1 and for any Borel measurable sets r2t_i,ilf C V{V{S)) and any 
Borel measurable set Q CU we have: 

R>'{nt,nt^i,e) 

= / / / R^\^t,d-ft-i,dut-i,dTrt-i,dat-i,dbt-i) 

JeJut-i jbjaJv{s) 

{<^T{ut~i,bt-i) e ilt}r(d6t_i I ■Kt^i,at-i)ut-^i{dTTt--i,dat-.i)R^'-{d'yt-i,dut-.i] 




r{^t I Jt~i,Ut-i)R^{djt-~i,dut^i) 

/e JQt-i 

where the last line follows from equation (33). □ 

Note that the dynamics given in equation (33), r{d'yt+i \ It-.'^^t) depend only on u^. This 
along with the fact that the cost at time t only depends on ut and the control constraint at 
time t only depends on 7^ suggests that we can simplify the form of the control policy from 
fit{dut I u^~^,b*~^) to midut I 7i)- 

Theorem 7.2 Without loss of generality, the optimization given in equation (31) can be 
modelled as a fully observed MDP with 

(1) State space V{V{S)) and dynamics given by (33) 

(2) Compact control constraints U{^) given by (26) 
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(3) Running cost c{u) given by (30). 



Proof: See section 10.2 of Bertsekas and Schreve In particular proposition 10.5. □ 

Theorem 7.1 shows that for any deterministic pohcy {/^([f'*^^]} with resulting joint mea- 
sure R'^ there is a corresponding channel input distribution q{dat \ 'Kt,b^~^) with resulting 
joint measure Q such that for all t: Q{d-Kt, dat, db^) = R^{dTTt, dat, db^). Hence Q{dat \ irt, 
= R^{dat I ■Kt,b^~^) for i?^ almost surely all 7rf,6*~^. Theorem 7.1 also shows c(/it[6*~^]) = 
lQ{At,Ut;Bt I i^'^-almost all 6*^^. 

By theorem 7.2. we know we can, without loss of generality, restrict ourselves to deter- 
ministic policies of the form: {/U([7t]}. Under such a policy we have: 



for Rf^ almost surely all vr^ , 6* ^,^t- For a fixed deterministic policy 7t is a function of 6' 
Thus the optimal channel input distribution takes the form {q{dat \ Trt,^t)} and 



Recall that in equation (18) we started with terms of the form I{A^;Bt \ ^) and have 
now simplified it to terms of the form I{At,Ht] Bt \ Tt). 

7.5 ACOE and Information Stability 

In this section we present the ACOE for the fully observed MDP corresponding to the 
equivalent optimizations in (18) and (31). We then show that the process is information 
stable under the optimal input distribution. Finally we relate the equivalent optimizations 
in (18) and (31) to the optimization given in (4): supj-p^joo ^ I_{A —i- B). 

The following technical lemma is required to insure the existence of a measurable selector 
in the ACOE given in (36) below. The proof is straightforward but tedious and can be found 
in the appendix. 

Lemma 7.7 For \B\ finite we have 

(1) The cost is bounded and continuous. Specifically, < c(u) < log \B\, Vti € U. 

(2) The control constraint function 'U{'^) is a continuous set-valued map between V{V{S)) 
and hi. 

(3) The dynamics r{d'yt+i | ItjUf) are continuous. 

We now present the infinite horizon average cost verification theorem. 

Theorem 7.3 // there exists a V* G ]R, a bounded function w : 'j 'w{'y) S JR, and 
a policy fi* achieving the supremum for each 7 in the following average cost optimality 
equation (ACOE): 



Qidat I 7Tt,b'-') = Rf'idat \ 7Tt,b' 



.t-i 



) = R>'{dat I -Ktnt) 



ciMlt]) = IgiA, lit; Bt I 7t) R^ - almost ah 7t 



(35) 



V* + wl'y) = sup 

wGW(7) 




(36) 



then 
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(1) V* is the optimal value of the optimization in (31). The optimal policy is the station- 
ary, deterministic policy given by ^* . 



(2) Under this /x* we have 



V* 



liminf-E^^. 

1 — >oo 1 



T 



t=i 



limsup— ^j^^. 



t=l 



and 



1 ^ 

lim - y c(Ut) = V* fi^* - a. 



t=i 



Proof: Follows from lemma 7.7 and theorems 6.2 and 6.3 of PQ. □ 

For a measure Q{d7rt, dat, db'') define 

r{bt I TTt,at) 



iQ{at,TTt;bt \b ) = lo, 



J r{bt I TTt,dt)Q{dTrt,ddt | 6* ^) 



Q — almost all at,TTt, 6* (37) 



The following theorem will allow us to view the ACOE, equation (36), as an implicit single- 
letter characterization of the capacity of the Markov channel. 

Theorem 7.4 Assume there exists aV* (z M, a bounded function u; : 7 i— > w{'y) G M, and 
a policy fi* achieving the supremum for each 7 in ACOE (36). For fi* and resulting joint 
measure R^* let {q*{dat \ vr^jft*"^)} be the corresponding optimal channel input distribution 
and Q* be the corresponding measure. 

(1) \huT^^^Y.\=iiQ*{Au^uBt I B'-^) = V* Q*-a.s. 

(2) The channel is directed information stable and has a strong converse under the optimal 
channel input distribution {q*{dat \ 7rt,6*~^)}. 

(3) V* = C = supj-pyjoo ^ I_(A B) is the capacity of the channel. 

Proof: We first prove part (2) and (3) assuming part (1) is true. Part (2) follows from part 
(1) and proposition 5.1. To prove part (3) note: 

C = sup Lq{A^B) 

{^t}??=i 
sup lim inf ^Iq {A^ B^) 

1 

sup lim inf - V ERc{Ut) 
1 ^ 

sup liminf — V'^/jc(C/j) 

= y* 

where (a) follows from lemma 4.1; (b) follows from theorems 7.1 and 7.2; and (c) follows 
from Bellman's principle of optimality. Note the supremizations in (b) and (c) are over 



(a) 
< 



(b) 
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policies that satisfy the control constraint (28). Now by part (1) we see that (a) holds with 
equality. Hence part (3) follows. 

We need only prove part (1). Note that theorem 7.3(2) implies: 



1 ^ 

V* = lim -yc([/t) R''* -a.s. 



(a) 



t=l 
T 



Y\m. ^^lQ,{At,IluBt j 6*"^) R''* - almost ah 6° 



i=l 
T 



lim lQ,{At,Iit;Bt j 6*"^) Q* - almost all 6° 



t=l 



where (a) follows from (35) and (b) follows because for each t: Q*{db^) = R^ {db*). Hence 
Q*{db°°) = Rf'^db'^). 

Define the nested family of sigma- fields: = (T(n*, yl*, i?*). Let 



Zt{TTuat,U)=iQ*{auTTtM I 6* - lQ.{At,Tlt; Bt \ b' 

Clearly Zt is F^-measurable and EQ*{Zt \ ^^t-i) = Q* — a.s. Hence Zt is a martingale 
difference sequence. The martingale stability theorem states if 



EQ,[Zf\¥t-i] 



t=i 



t2 



< OO Q* — a.s. 



(38) 



then limT->oo y X]t=i = ^ Q* — a.s. This in turn would imply 

t t 

1 — >00 1 ^ ^ ' — *on / ^ * 



t=l 



t=l 



for Q* - almost ah 7r°° , , 6°° . 

To show that (38) holds note that for any t and Q*-alm.ost all we have: 



Eq^Z^ I b^~' 
< Eg^ii'Q.iAuUuBt I S 

= En* 



t-U I ^t-li 



\og\iBt I Hj, At) +log2 r(St I 7ft,dt)Q*idfruddt \ S*"^)) 
-21ogr(St I Ht, At)log (^j r{Bt \ Tft,dt)Q*{d^t,d~at \ B^-^)^ \ 

< Eq, [\og\{Bt I Tlt,At) I b'-^]+EQ, log2 r(Bt | ^trat)Q*{d^uddt \ B 

< 2\B\ 

The last inequality follows because the function x log^ x achieves a maximum value of 1 over 
the domain < x < 1. Since is summable we see that (38) holds. □ 
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There remains the question of when a solution to the ACOE exists. There exist many 
sufficient conditions for the existence of a solution. See for a representative sample. 

Most of these conditions require the process be recurrent under the optimal policy. The 
following theorem describes one such sufficient condition: 

Theorem 7.5 // there exists an a < 1 such that 

sup \\r{d'yt+i \ -ft^ut) - rid-jt+i \ lt,ut)\\TV < « (39) 

7t>7t:'"teW(7t)>"teW(7t) 

then the ACOE (36) has a hounded solution. Here \\ ■ \\tv denotes the total variation norm. 

Proof: See corollary 6.1 of 

Condition (39) insures that for any stationary policy there exists a stationary distribu- 
tion. Specifically: 

Proposition 7.1 If (39) holds then for all stationary policies of the form, fi : -f ^ u^dir, da), 
there exists a probably measure on V{S) such that for any e > there exists a T large 
enough such that yt>T: 

K{dlt\li)-Mdlt)\\TV <e (40) 
where rj^{d'yt \ 7i) is the t—step transition stochastic kernel under the stationary policy fi. 
Furthermore limx ^ oo ^Ejii^ St=i c (/^(Tf)) = f c {fi{'y)) X^{d'y) independent of the choice 
ofp{dsi). 

Proof: See lemma 3.3 of J7j- ^ 

We have until this point assumed that the Markov channel parameters are fixed. The 
last part of proposition 7.1 shows that the capacity C is the same no matter which choice 
of p{dsi) is chosen. 

In the case that one chooses a policy without feedback equation (40) essentially reduces 
to the definition of indecomposability found in Gallager |14j . equation 4.6.26. 

Finding conditions that imply (39) or (40) directly in terms of the Markov channel, 
{p{dsi),p{dst+i I st,at,bt),p{dbt \ st,at)}, is challenging. This is essentially the problem of 
determining conditions for the ergodicity of the underlying hidden Markov model under the 
optimal stationary policy. See JH]) [2]i |23; ^0] for some representative conditions. 



8 Cases with Simple Sufficient Statistics 

As we have already seen the sufficient statistics lit ^ ^('5) and Tt € V{V{S)) can be 
quite complicated in general. There are, though, many situations where they become much 
simpler. 



8.1 S Computable from the Channel Input and Output 

Recall that Ilf is a function of {A^^^ , B^^^) and satisfies the recursion: vr^+i = <I>n(7rt, at, bt). 
In many scenarios the state St is computable from {A^~^ , B^^^). Here we assume that 
p{dsi) = 5^si}{dsi) for some fixed state si and for t > 1: Ilt{dst) = S{St}idst) Q — cl.s. 
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This in turn implies there exists a function <I>5 such that st+i = ^s{st,o,t,bt). To see 
this recall equation (13). Because nt,nt_|_i are Diracs Q-almost surely it must be the case 
that St+i is a function of St,At,Bf; Q — a.s. One example of such a channel would be: 
{p{dbt I oj, aj_i, 6t_i)}. Here one could choose the state: St = {Af-i, Bt-i). 

We can directly associate 11 with S. In addition T can be viewed as a conditional 
probability of the state S. Hence we can restrict ourselves to control policies of the form: 
fi : 7^(3) U = V{S X A) taking 7 1-^ u{ds, da). Now the control constraints take the form 
U{j) = {n((is, da) : u{ds, da) € U, u{ds) = 7((is)} . The dynamics of Tf given in equations 
(24), (25) simplify to: 7i(dsi) = 6^s-^y{dsi) and for t > 1 and all St'- 

7t[u*"\6*"^](st) = ^ 6{^^,(^st-i,at-i,bt-i)}i^t) ^(d^t-^^^t-i \ ^t-i,bt-i) (41) 

Hence equation (33) simplifies to: 

r(d7t+i I -/t,ut) = ^ (5{$j,(„,^fe^)}((i7t+i) p{bt \ st,at) ut{st,at) (42) 

st,at,bt 

where ^riu,b) comes from (41). The cost in equation (30) simplifies as well: 

c{u) = p(M s, a)u{s, a) log .L^'"'* (43) 

Es,dP{b\ s,a)u{s,a) 

In addition 

I{At,Ut; Bt I Tt) = I{At; Bt \ St, Tt) + I{St; Bt \ Tt) (44) 

Finally the ACOE equation (36) in theorem 7.3 simplifies to an equation where w{'y) is now 
a function over V{S): 



V* + w{-y) = sup i c{u) + / w{'y)r{d'y\^,u) \ (45) 

«GW(7) \ J ) 

The sufficient condition, equation (39), given in theorem 7.5 continues to hold with dynamics 
given by (42). 

We now examine two cases where the computations simplify further: S is either com- 
putable from the channel output or channel input only. 



8.1.1 Case 1: S Computable from the Channel Input Only 

Here we assume St is computable from only Al '^^ and hence S is known to the transmitter. 
Hence Hj is a function of Al~^ and satisfies the recursion: i^t+\ = ^w^T^ti'^t)- This in 
turn implies there exists a function $s such that sj+i = <I>5(sj,ai). These channels are 
often called finite state machine Markov channels. Note that any general channel of the 
form {p{dbt \ at, OjZa)}' ^o^' ^ finite A, can be converted into a Markov channel with state, 
St = A^z]\, computable from the channel input. 

As before we can directly associate H with S and F can be viewed as a conditional 
probability of the state S. Equations (41)-(45) continue to hold with obvious modifications. 
See [HHl; [nnj for more details. For Gaussian finite state machine Markov channels the 
estimate Ft can be easily computed by using a Kalman filter j4Uj . |41j . 
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8.1.2 Case 2: 5 Computable from the Channel Output Only 

Here we assume St is computable from only B^~^. Thus S is known to the receiver, and 
via feedback, is known to the transmitter. Then Ut is a function of B^~^ and satisfies the 
recursion: TTt+i = ^nl^Tf, 6*). This in turn implies there exists a function $5 such that 
st+i = ^s{st, bt)- Note that any general channel of the form {p{dbt \ at, blz]\)}, for a finite 
A, can be converted into a Markov channel with state, St = computable from the 

channel output. 

As before we can directly associate n with S. In addition, because 11 is computable 
from the channel outputs we can directly associate T with IT and hence with S. We can 
then restrict ourselves to control policies of the form: ^ : S ^lA = ^^{A) taking 7 1-^ u[da). 
To see this note that the control constraints become trivial and hence we can use control 
actions of the form u{da) as opposed to u{ds, da). In this case the dynamics in (33) simplify 
quite a bit: 71 = si and for t > 1: 

r{d-it+i I ^t,ut) = ^ (5{$g(^j^bj)}((i7t+i) p{bt \ -ft, at) ut{at) (46) 

at,bt 

The cost in equation (30) simplifies as well: 

c(s, u) = S2p{b\ s, a)u{a) log ^ | '^'"^ 

HaPibl s,a)u{a) 

In addition I{At,Ut; Bt \ Tt) = I{At, St; Bt \ St) = I{At; Bt \ St). Finally the ACOE equation 
(36) in theorem 7.3 simplifies to an equation where ^(7) is now a function over S: 

V* + w{'j) = sup i c{'j,u) + / 'w{'j)r{d'y\"f,u) ] (47) 
ueu \ J J 

Markov Channels with State Observable to the Receiver: An important scenario 
that falls under the case just described is that of a Markov channel, p{dsi), {p{dst+i \ st, at, h)}, 
{p{dbt I St, at)} with state observable to the receiver. Specifically at time t we assume that 
along with Bt, the state St+i is observable to the receiver. The standard technique for 
dealing with this setting is to define a new channel output as follows: Bt = {Bt, St~\^i). 
The new Markov channel has the same state transition kernel but the channel output is: 
p{dbt I St, at) = p{dst+i \ st,at,bt) ® p{dbt \ st,at). We also assume that Si is observable 
to the transmitter. (This can be achieved by assuming that Bq = Si is transmitted during 
epoch 0.) Thus the dynamics in (46) can be written as: 71 = si and for t > 1: 

ri^t+i I lt,ut) = ^p(7t+i I ^t,at,bt) p{bt \ It, at) uMt) (48) 

atybt 

Also, I{At,Iit',Bt I r^) = I{At;Bt \ St) + I{At; St+i \ St,Bt). The second addend is zero if 
there is no ISI. 

The sufficient condition, equation (39), given in theorem 7.5 continues to hold with 
dynamics given by (48). If there is no ISI then equation (48) reduces to: r{d'yt+i \ lt,ut) = 
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p((i7t+i I 7t)- If p{dst+i I St) is an ergodic transition kernel with stationary distribution 
v then there exists a bounded solution to the ACOE The ACOE reduces to: V* = 
i'(s) max„ c(s, u). Thus we recover the well known formula for the capacity of a non-ISI 
ergodic Markov channel with state available to both the transmitter and receiver. 

8.2 n Computable from the Channel Output 

Here we assume that IIj is a function of only and thus satisfies the recursion: vrj+i = 
^n('?i"t; ^t)- Hence Tt{d7rt) = 6^Yi^j{dTTt) Q — a.s.. We can thus directly associate T with 
n. Now r can be viewed as a conditional probability of the state S. One can view the 
associated canonical Markov channel as a Markov channel with state 11 computable from 
the channel output only (as discussed in the previous section.) 

We can then restrict ourselves to control policies of the form: fi : V{S) lA = 'P{A) 
taking 7 1— i- u{da). To see this note that the control constraints become trivial and hence we 
can use control actions of the form u{da) as opposed to u{dTT, da). In this case the dynamics 
in (33): 7i(dvri) = 6^p(^dsi)}{d'^i) and for t > 1: 

rid-yt+i I lt,ut) = J2 '^{■J'n(7tA)}('^7i+i) P{bt \ st,at) -ft{st) ut{at) (49) 

st,at,bt 

The cost in equation (30) simplifies as well: 

f^^ Es,aPib\ s,a) 7r{s) u{a) 

In addition I{At,Ilt;Bt \ Ft) = I{At;Bt \ IIj). Finally the ACOE equation (36) in theorem 

7.3 simplifies to an equation where w{'y) is now a function over V{S): 

y* + ti;(7) = sup ( c(7, li) + / ii;(7)r((i7|7, m) I (50) 
u&A \ J J 

In this case the optimal channel input distribution q{dat \ T^t,lt) can be written in the 
form q{dat \ b^~^). Furthermore the code-function distribution can be taken to be a product 
distribution. Choose for each t and ft- 

V{ft) = \{q{ft{b'~')\b'~'). 

Then PpT{df'^) = ^f=ip{dft). One can easily verify for each t that PpT(T*(6*~^, a*)) = 
g(a* I b^~^) and hence is good with respect to {q{dat \ b^~^)}. 

In summary, if the sufficient statistic 11 is computable from the channel output then the 
optimal code-function distribution can be taken to be a product measure. If Ilf depends on 
A*'~^ then the optimal code-function, in general, will not be a product measure. 

In this section we discussed scenarios where the sufficient statistics had special structure. 
One open question is to determine whether the sufficient statistics will simplify if we restrict 
ourselves to special classes of code-functions. As an example, it was shown in ^3] that for 
non-ISI Markov channels with no feedback one can find a single-letter formula when the 
channel inputs are independent and identically distributed. 
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9 Maximum Likelihood Decoding 

We now consider the problem of maximum-likelihood decoding. For a given message set W 
fix a channel code w S W}. Assume the messages are chosen uniformly. Hence each 

channel code-function is chosen with probability PpT{f'^[w]) = j^^. For a consistent joint 

measure Q{df'^ , da^ , db'^) our task is to simplify the computation of argmax^„g>v Q{f'^[w] \ b^)- 
First consider the general channels described in section 5. Note 

Q{fy) = Qif,a^ = fib^-')y) 

= Qif I = f{b^-'), h") Q{a^ = f{b^-'), h") (40) 
Also, if Q(a^,5^) > then 



p^T(np-(b^ia^)nL%(6^-i)}K) 

p{bT I a^)nLQ(at I a*-i,5*-i) 

(a) PpTif) 

PFT{TT{bT~i,aT)) 

1 



\TT{bT-\aT)\ 



(41) 



where (a) follows by lemma 5.1. Note this implies that F'^ — — is not a Markov 
chain under Q. 

Due to the feedback we effectively have a different channel code without feedback for 
each Specifically, for each b'^^^ define 



A(6^ = {a' : = r H(^ ~ ) for some w € W}. 

From equations (40) and (41) we see that computing argmax^gw Q{f'^[w\ \ b^) is equivalent 
to computing: 

I ^ 

|T^(b^-i,a^)| n^^^* I '3(ai I a* ') (42) 

where {Q{dat \ a*~^,6*~^)} is the induced channel input distribution for PpT{f'^[w\). 

For the Markov channel case we may replace p{dbt \ a*,b^~^) with p{dbt \ irtjat) in 
equation (42). If in addition, the channel code is chosen such that the induced channel 
input distribution has the form {q{dat \ 71^,7^)} then: 



1 ^ 



(43) 



In the case where \T'^{b'^ ^,0^)] = 1, Va"^ G A(6^ ^) the optimization in (43) can be 
treated deterministic longest path problem. 
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10 Conclusion 



We have presented a general framework for treating channels with memory and feedback. 
We first proved a general coding theorem based on Massey's concept of directed information 
and Dobrushin's program of communication as interconnection. We then specialized this 
result to the case of Markov channels. To compute the capacity of these Markov channels 
we converted the directed information optimization problem into a partially observed MDP. 
This required identifying appropriate sufficient statistics at the encoder and decoder. The 
ACOE verification theorem was presented and sufficient conditions for the existence of a 
solution were provided. The complexity of many feedback problems can now be understood 
by examining the complexity of the associated ACOE. 

The framework developed herein leaves open the possibility of using approximate dy- 
namic programming techniques, like value and policy iteration and reinforcement learning, 
for computing the capacity. In addition the framework allows one to compute the capacity 
under restricted policies. This is useful if one is willing to sacrifice capacity for the benefit 
of a simpler policy. 

Acknowledgements: The authors would like to thank Vivek Borkar for many helpful 
discussions. 

A Appendix 

A.l Review of Stochastic Kernels 

The following results are standard and can be found in, for example, [1]. Let (V, ^) be 
a Borel space and let {X,Bx) and {y,By) be Polish spaces equipped with their Borel 
(T- algebras. 

Definition A.l Let T[dx \ v) be a family of probability measures on X parameterized by 
V ^V. We say that t is a stochastic kernel from V to X if for every Borel set B E Bx, the 
function v ^ t{B \ v) G [0, 1] is measurable. 

Lemma A.l For B G Bx, define fs ■ V{X) [0, 1] by fs : ^ for ^ G V{PX). 

Then 

B-pi^X) = Cy[^B(iBxfB^i^IR)\ 

Theorem A.l Let T{dx \ v) be a family of probability measures on X given V. Then 
T{dx \ v) is a stochastic kernel if and only if v G V V{X) is measurable. That is if and 
only if t{- \ v) is a random variable from V into V{X). 

Since r(- \ v) is a. random variable from V into V{X) it follows that the class of stochastic 
kernels is closed under weak limits (weak topology on the space of probability measures.) 

We now discuss interconnections of stochastic kernels. Let Ti{dx | f ) be a stochastic 
kernel from V to X and T2{dy \ v, x) be a stochastic kernel from V x X to y. Then the joint 
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stochastic kernel ri (S> T2 from V to x is for all v € V, ^ € Bx, and B € By we have 
Ti 'Si T2{A X B \ v) = / T2((iy | v,x)Ti{dx \ v) = / T2(i? | v,x)T2{dx I z;). 

Via the lonescu-Tulcea theorem this can be generalized to interconnections of countable 
number of stochastic kernels. 

We now discuss the decompositions of measures. 

Theorem A. 2 LetX{dxSdy) be a probability measure on {JY x y,Bx ® By). Let Xi{A) = 
X{A,y), A G Bx be the first marginal. Then there exists a stochastic kernel X{dy \ x) on y 
given X such that for all A G Bx and B G By we have 

X{A xB)= j Xi{dx)X{dy \ x) = [ X{B \ x)Ai(dx) 
Jaxb Ja 

This can be generalized to a parametric dependence: 

Theorem A. 3 Let X{dx S dy \ v) be a stochastic kernel on X x y given V. Let Xi{A \ v) 
be the first marginal which is a stochastic kernel on X given V defined by 

Xi{A I v) = X{A,y \v), A(^ Bx, veV. 

Then there exists a stochastic kernel X{dy \ v, x) on y given V x X such that Mv G V, 
A G Bx, and B G By we have 

X{A X B \ v) = / Xi{dx I v)X{dy \ v,x) = / X{B \ v,x)Xi{dx \ v) 
Jaxb Ja 

Let X{dx S dy \ v) he a stochastic kernel on ^ x 3^ given V and suppose the stochastic 
kernel T{dy \ v,x) on 3^ given V x X satisfies: 

Vf G V,V i? G By we have X{B \ v,x) = t{B \ v,x) for Xi{dx \ v) almost all x. 

Then for any measurable function g: VxXxy^M and all f G V we have 

E{g{V,X,Y) \ v) = g{v,x,y)T{dy \ v,x)Xi{dx \ v) 

J Xxy 

whenever the conditional expectation on the left hand side exists. 

Finally, recall that a stochastic kernel T{dy | x) on 3^ given X is continuous if for all 
continuous bounded functions v on y the function J v{y)p{dy | x) is a continuous and 
bounded function on X. 

Theorem A. 4 If T{dy \ x) is a continuous stochastic kernel on y given X and v{x,y) is 
a continuous bounded function on X x y then J v{x,y)T{dy \ x) is a continuous bounded 
function on X. 
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A. 2 Lemma 4.1 



The proof of lemma 4.1 below is adapted from lemma Al of ^H] and theorem 8 of We 
need the following three lemmas. Combined they state that the mass of i{A^;B'^) at the 
tails is small. Recall that \B\ < oo. 

Lemma A. 2 Let L > log \B\. For any sequence of measures {P^t}J^i we have 



lim E 



0. 



Proof: Let = {6^ : < 2"^^}. Now 



E 



1/1 



bTen 



Pin) 

< ip(n)log|5^|-ip(17)logP(Q) 

< lp(17)log|fi^| + ^ 

where the first inequality follows because entropy is maximized by the uniform distribution 
and the second inequality follows because — xlogx < < x < 1. Now P{^) < \Q\2~'^^ < 



|gT|2-TL^ ThusS 



>L 



< log|i3|2-^(^-i°gl^i) + This 



upper 



bound goes to zero as T ^ co. □ 
Lemma A. 3 For any sequence of joint measures {P4T b'^}t'=i have 

1 



lim E 

T->OD 



—i{A^;B'^) l{^i(^T.BT) < 0} 







Proof: Follows from page 10 of Pinsker j25j . 

Lemma A. 4 Let L > log |^|. For any sequence of joint measures {-P4T b'^}'t=i have 

1 



lim E 



—i{A^;B'^) 1|^^(^t.bt) > 



0. 



Proof: Let = {6^ : P{h^) < 2'^^}. Note that > ^^p^'j^"^^ Pat,bt - a.s. Now 



E 



—i{A^,B'^) l{^i(^T.BT) > 



E 



< E 



1 p{B^ I yl^) 



Hog £(5^4^ 



> L 



lo: 



> L 



< (log|fi|2-^(^-'°sl^l) + ^) 
The last inequality follows from lemma A. 2. This upper bound goes to zero as T ^ 00. □ 
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Proof of Lemma 4.2: The second inequality is obvious. To prove the first inequaHty note 
Ve > we have 



> E 



1 ^ p{B^ I A^) _ 



+ Ox P 



p{b'-'') 



1 , p(B^ I A^) 



<0 



B) 



+ {I{A^B))P 



A 



T ^ P{BT) 



> I{A B) 



The first addend goes to zero by lemma A. 3, the second addend equals zero, and the 
probability in the last addend goes to 1. Thus for T large enough ^I{A^ B"^) > / — 2e. 
Since e is arbitrary we see that 1{A B)< liminfr^oo ^I{A^ ^ -B^). 
Now we treat the last inequality. For any e > we have 



-j{A 



B' 



< E 



1 , p(B^ I A^) 
^log 



I 7 



+ LP 



1 p(B'^ I A^) 



} 

B) + e 



+ {I{A^B)+e)P 



llog^I^^</(A 



B) + e 



The first addend goes to zero by lemma A. 4, the second addend goes to zero by definition of 
I, and the probability in the last addend goes to 1. Thus for T large enough ^I{A^ —>■ B'^) 
< 7 + 2e. Since e is arbitrary we see that limsupji^^Q^ ^I{A^ B^) < I_{A — > B). □ 



A. 3 Lemma 7.7 

We repeat the statement of lemma 7.7 for convenience. 
Lemma 7.7 For \B\ finite we have 

(1) The cost is bounded and continuous. Specifically, <c{u) < log \B\, Vu £U. 

(2) The control constraint function U{j) is a continuous set-valued map between V{V{S)) 
and hi. 

(3) The dynamics r{d'yt+i \ ItjUt) are continuous. 

Proof: To prove part (1) recall c{u) = J r{db \ tt, a)u{d'7T, da) log j j.{b\% ^d)u{dTr da) ' '^'^^^ 
c(u) corresponds to a mutual information with input distribution u{d7T, da) and an output 
in a finite alphabet B. Hence c(n) < log \B\ Vn € U. The cost is clearly continuous in 
u eU. 

To prove part (2) recall U{^) = {u{d'K,da) : u{dTr,da) G U, u{dTr) = ^{dir)}. The set 
f/(7) is compact for each 7 G V{V{S)). For any set H C U denote U^^{H) = {7 : 
[7(7) f] H ^ 0}. The set-valued map U{^) is continuous if it is both 
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(1) Upper semicontinuous (use): U~^{F) is closed in V{'P{S)) for every closed set F CU. 

(2) Lower semicontinuous (use): U~^{G) is open in V{V{S)) for every open set G dlA. 

The control constraint U (7) is clearly both use and Isc and hence is continuous. 
To prove part (3) recall equation (33): 




r((i7t+i I 7t,Mt) =111 5{$j,(„,^bj)}((i7t+i) r{dht \ TTt,at) ut{d7Tt,dat). 
JU J A JB 

Since this stochastic kernel does not depend on 7t we only need to show that it is continuous 
in ut- Specifically, let v be any continuous bounded function on V{V{S)). We need to show 

V {^r{u,b)) r{db \ 7r,a) u{dTT,da) {Al) 

is a continuous function of Ut- 

By equation (25) we know for all Borel measurable O C 'P('S): 

7[u,6](0) = y J {<^n{T^, a, b) en}r{dTr, da \ u,b) . {A2) 

By lemma 7.1 we know for any Borel measurable C 'P{S), a, b, and u we have 

I M /e ^ (fe I g) ujdn, a) 
J J r [0 \ TT,a) uydTT, da) 

when the denominator does not equal zero. Because B is finite and by repeated use of 
Theorem A. 4 we see that (A3) is continuous in u, b for all O. This implies (A2) is continuous 
in u, b for all 0. Thus implying (Al) is continuous in u. □ 
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